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Preface 


The term chemometrics was proposed more than 20 years ago to describe the 
techniques and operations associated with the mathematical manipulation and 
interpretation of chemical data. It is within the past 10 years, however, that 
chemometrics has come to the fore, and become generally recognized as a 
subject to be studied and researched by all chemists employing numerical data. 
This is particularly true in analytical science. In a modern instrumentation 
laboratory, the analytical chemist may be faced with a seemingly overwhelming 
amount of numerical and graphical data. The identification, classification and 
interpretation of these data can be a limiting factor in the efficient and effective 
operation of the laboratory. Increasingly, sophisticated analytical instru- 
mentation is also being employed out of the laboratory, for direct on-line or 
in-line process monitoring. This trend places severe demands on data manipu- 
lation, and can benefit from computerized decision making. 

Chemometrics is complementary to laboratory automation. Just as auto- 
mation is largely concerned with the tools with which to handle the mechanics 
and chemistry of laboratory manipulations and processes, so chemometrics 
seeks to apply mathematical and statistical operations to aid data handling. 

This book aims to provide students and practising spectroscopists with an 
introduction and guide to the application of selected chemometric techniques 
used in processing and interpreting analytical data. Chapter 1 covers the basic 
elements of univariate and multivariate data analysis, with particular emphasis 
on the normal distribution. The acquisition of digital data and signal enhance- 
ment by filtering and smoothing are discussed in Chapter 2. These processes are 
fundamental to data analysis but are often neglected in chemometrics research 
texts. Having acquired data, it is often necessary to process them prior to 
analysis. Feature selection and extraction are reviewed in Chapter 3; the main 
emphasis is on deriving information from data by forming linear combinations 
of measured variables, particularly principal components. Pattern recognition 
comprises a wide variety of chemometric and multivariate statistical techniques 
and the most common algorithms are described in Chapters 4 and 5. In Chapter 
4, exploratory data analysis by clustering is discussed, whilst Chapter 5 is 
concerned with classification and discriminant analysis. Multivariate cali- 
bration techniques have become increasingly popular and Chapter 6 provides a 
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summary and examples of the more common algorithms in use. Finally, an 
Appendix is included which aims to serve as an introduction or refresher in 
matrix algebra. 

А conscious decision has been made not to provide computer programs of 
the algorithms discussed. In recent years, the range and quality of software 
available commercially for desktop, personal computers has improved dramati- 
cally. Statistical software packages with excellent graphic display facilities are 
available from many sources. In addition, modern mathematical software tools 
allow the user to develop and experiment with algorithms without the problems 
associated with developing machine specific input/output routines or high 
resolution graphic interfaces. 

The text is not intended to be an exhaustive review of chemometrics in 
spectroscopic analysis. It aims to provide the reader with sufficient detail of 
fundamental techniques to encourage further study and exploration, and aid in 
dispelling the ‘black-box’ attitude to much of the software currently employed 
in instrumental analytical analysis. 
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Descriptive Statistics 


1 Introduction 


The mathematical manipulation of experimental data is a basic operation 
associated with all modern instrumental analytical techniques. Computeri- 
zation is ubiquitous and the range of computer software available to spectro- 
scopists can appear overwhelming. Whether the final result is the determination 
of the composition of a sample or the qualitative identification of some species 
present, it is necessary for analysts to appreciate how their data are obtained 
and how they can be subsequently modified and transformed to generate the 
required information. A good starting point in this understanding is the study 
of the elements of statistics pertaining to measurement and errors. ? Whilst 
there is no shortage of excellent books on statistics and their applications in 
spectroscopic analysis, no apology is necessary here for the basics to be 
reviewed. 

Even in those cases where an analysis is qualitative, quantitative measures 
are employed in the processes associated with signal acquisition, data extrac- 
tion, and data processing. The comparison of, say, a sample's infrared spectrum 
with a set of standard spectra contained in a pre-recorded database involves 
some quantitative measure of similarity in order to find and identify the best 
match. Differences in spectrometer performance, sample preparation methods, 
and the variability in sample composition due to impurities will all serve to 
make an exact match extremely unlikely. In quantitative analysis the variability 
in results may be even more evident. Within-laboratory tests amongst staff and 
inter-laboratory round-robin exercises often demonstrate the far from perfect 
nature of practical quantitative analysis. These experiments serve to confirm 
the need for analysts to appreciate the source of observed differences and to 
understand how such errors can be treated to obtain meaningful conclusions 
from the analysis. 

Quantitative analytical measurements are always subject to some degree of 


1 C. Chatfield, ‘Statistics for Technology’, Chapman and Hall, London, UK, 1976. 

2 Р.К. Bevington, ‘Data Reduction and Error Analysis for the Physical Sciences’, McGraw-Hill, 
New York, USA, 1969. 

3 J.C. Miller and J.N. Miller, ‘Statistics for Analytical Chemistry’, Ellis Horwood, Chichester, UK, 
1993. 
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error. No matter how much care is taken, or how stringent the precautions 
followed to minimize the effects of gross errors from sample contamination or 
systematic errors from poor instrument calibration, random errors will always 
exist. In practice this means that although a quantitative measure of any 
variable, be it mass, concentration, absorbance value, etc., may be assumed to 
approximate the unknown true value, it is unlikely to be exactly equal to it. 
Repeated measurement of the same variable on similar samples will not only 
provide discrepancies between the observed results and the true value, but there 
will be differences between the measurements themselves. This variability can 
be ascribed to the presence of random errors associated with the measurement 
process, e.g. instrument generated noise, as well as the natural, random vari- 
ation in any sample's characteristics and composition. As more samples are 
analysed or more measurements are repeated then a pattern to the inherent 
scatter of the data will emerge. Some values will be observed to be too high and 
some too low compared with the correct result, if this is known. In the absence 
of any bias or systematic error the results will be distributed evenly about the 
true value. If the analytical process and repeating measurement exercise could 
be undertaken indefinitely, then the true underlying distribution of the data 
about the correct or expected value would be obtained. In practice, of course, 
this complete exercise is not possible. It is necessary to hypothesize about the 
scatter of observed results and assume the presence of some underlying pre- 
dictable and well characterized parent distribution. The most common assump- 
tion is that the data are distributed normally. 


2 Normal Distribution 


The majority of statistical tests, and those most widely employed in analytical ' 
science, assume that observed data follow a normal distribution. The normal, 
sometimes referred to as Gaussian, distribution function is the most important 
distribution for continuous data because of its wide range of practical applica- 
tion. Most measurements of physical characteristics, with their associated 
random errors and natural variations, can be approximated by the normal 
distribution. The well known shape of this function is illustrated i in Figure 1. As 
shown, it is referred to as the normal probability curve.? The mathematical 
model describing the normal distribution function with a single measured 
variable, x, is given by Equation (1). 
-(х- ы) 
Л) = Jos an | o | (1) 


The height of the curve at some value of x is denoted by f(x) while ин, and с are 
characteristic parameters of the function. The curve is symmetric about ы, the 
mean or average value, and the spread about this value is given by the variance, 
o°, or standard deviation, с. It is common for the curve to be standardized so 
that the area enclosed is equal to unity, in which case f(x) provides the 
probability of observing a value within a specified range of x values. With 
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Figure 1 Standardized normal probability curve and characteristic parameters, the mean 
and standard deviation 


reference to Figure 1, one-half of observed results can be expected to lie above 
the mean and one-half below р. Whatever the values of р and с, about one 
result in three will be expected to be more than one standard deviation from the 
mean, about one in twenty will be more than two standard deviations from the 
mean, and less than one in 300 will be more than 3o from p. 

Equation (1) describes the idealized distribution function, obtained from an 
infinite number of sample measurements, the so-called parent population distri- 
bution. In practice we are limited to some finite number, n, of samples taken 
from the population being examined and the statistics, or estimates, of mean, 
variance, and standard deviation are denoted then by x, 52, and s respectively. 
The mathematical definitions for these parameters are given by equations 


HA, 


= xn (2) 
ігі 

2 = S-a- @) 
і= 1 

s = 689) @ 


where the subscript i (i= 1... п) denotes the individual elements of the set of 
data. 

A simple example serves to illustrate the use of these statistics in reducing 
data to key statistical values. Table 1 gives one day's typical laboratory results 
for 40 mineral water samples analysed for sodium content by flame photometry. 
In analytical science it is common practice for such a list of replicated analyses 
to be reduced to these descriptive statistics. Despite their widespread use and 
analysts’ familiarity with these elementary statistics care must be taken with 
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Table 1 The sodium content (mg Кв”) of bottled mineral water as determined 


by flame photometry 
Sodium (mg Кв!) 
10.8 10.4 11.7 10.6 12.2 
11.1 12.2 11.3 11.5 10.2 
10.6 11.6 10.2 11.2 10.6 
10.9 10.2 10.3 10.2 10.3 
11.5 10.6 10.5 10.2 10.1 
11.2 12.4 12.4 10.4 12.5 
10.5 11.6 10.3 10.5 11.6 
11.8 12.3 10.1 12.2 10.8 
Group Means: 11.1 11.4 10.8 10.8 11.0 
Group $°: 0.197 0.801 0.720 0.514 0.883 
Group s: 0.444 0.895 0.848 0.717 0.939 
Total Mean = 11.04 mg kg™! 
5 = 0.60 тр? kg"? 


s= 0.78 mg Кр”! 
%RSD= 7.03% 


Table 2 Concentration of chromium and nickel, determined by AAS, in samples 
taken from four sources of waste waters 


Source A B C D 
Cr Ni Cr Ni Cr Ni Cr Ni те Ер”! 
10 8.04 10 9.14 10 746 8 6.58 
8 6.95 8 8.14 8 6.77 8 5.76 
13 7.58 13 8.74 13 12.74 8 7.71 
9 8.81 9 8.77 9 7.11 8 8.84 
11 8.33 11 9.26 11 7.81 8 8.47 
14 9.96 14 8.10 14 8.84 8 7.04 
6 7.24 6 6.13 6 6.08 8 5.25 
4 4.26 4 3.10 4 5.39 19 12.5 
12 10.84 12 9.13 12 8.15 8 5.56 
7 4.82 7 7.26 7 6.42 8 7.91 
5 5.68 5 4Л4 5 5.74 8 6.90 
Меап 9 7.50 9 7.50 9 7.50 9 7.50 
5 3.16 1.94 316 1.94 3.16 1.94 3.16 1.94 
r 0.82 0.82 0.82 0.82 


their application and interpretation; in particular, what underlying assump- 
tions have been made. In Table 2 is a somewhat extreme but illustrative set of 
data. Chromium and nickel concentrations have been determined in waste 
water supplies from four different sources (A, B, C and D). In all cases the mean 
concentration and standard deviation for each element is similar, but careful 
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examination of the original data shows major differences in the results and 
element distribution. These data will be examined in detail later. The practical 
significance of reducing the original data to summary statistics is questionable 
and may serve only to hide rather than extract information. As a general rule, it 
is always a good idea to examine data carefully before and after any trans- 
formation or manipulation to check for absurdities and loss of information. 

Whilst both variance and standard deviation attempt to describe the width of 
the distribution profile of the data about a mean value, the standard deviation is 
often favoured over variance in laboratory reports as s is expressed in the same 
units as the original measurements. Even so, the significance of a standard 
deviation value is not always immediately apparent from a single set of data. 
Obviously a large standard deviation indicates that the data are scattered 
widely about the mean value and, conversely, a small standard deviation is 
characteristic of a more tightly grouped set of data. The terms ‘large’ and 
‘small’ as applied to standard deviation values are somewhat subjective, 
however, and from a single value for s it is not immediately apparent just how 
extensive the scatter of values is about the mean. Thus, although standard 
deviation values are useful for comparing sets of data, a further derived 
function, usually referred to as the relative standard deviation, RSD, or co- 
efficient of variation, CV, is often used to express the distribution and spread of 
data. 


9, CV, %RSD = 1005/2 (5) 


If sets or groups of data of equal size are taken from the parent population 
then the mean of each group will vary from group to group and these mean 
values form the sampling distribution of x. As an example, if the analytical 
results provided in Table 1 are divided into five groups, each of eight results, 
then the group mean values are 11.05, 11.41, 10.85, 10.85, and 11.04 mg kg~’. 
The mean of these values is still 11.04, but the standard deviation of the group 
means is 0.23 mg Кр”! compared with 0.78 mg Кр”! for the original 40 obser- 
vations. The group means are less widely scattered about the mean than the 
original data (Figure 2). The standard deviation of group mean values is 
referred to as the standard error of the sample mean, Om, and is calculated from 


Om = оу/уп (6) 


where o, is the standard deviation of the parent population and n is the number 
of observations in each group. It is evident from Equation (6) that the more 
observations taken, then the smaller the standard error of the mean and the 
more accurate the value of the mean. This distribution of sampled mean values 
provides the basis for an important concept in statistics. If random samples of 
group size п are taken from a normal distribution then the distribution of the 
sample means will also be normal. Furthermore, and this is not intuitively 
obvious, even if the parent distribution is not normal, providing large sample 
sizes (n > 30) are taken then the sampling distribution of the group means will 
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Figure 2 Group means for the data from Table 1 have a lower standard deviation than the 
original data 


still approximate the normal curve. Statistical tests based on an assumed 
normal distribution can therefore be applied to essentially non-normal data. 
This result is known as the central limit theorem and serves to emphasize the 
importance and applicability of the normal distribution function in statistical 
data analysis since non-normal data can be normalized and can be subject to 
statistical analysis." 


Significance Tests 


Having introduced the normal distribution and discussed its basic properties, 
we can move on to the common statistical tests for comparing sets of data. 
These methods and the calculations performed are referred to as significance 
tests. An important feature and use of the normal distribution function is that it 
enables areas under the curve, within any specified range, to be accurately 
calculated. The function in Equation (1) is integrated numerically and the 
results presented in statistical tables as areas under the normal curve. From 
these tables, approximately 68% of observations can be expected to lie in the 
region bounded by one standard deviation from the mean, 95% within р + 20, 
and more than 99% within p + Зо. 

We сап return to the data presented in Table 1 for the analysis of the mineral 
water. If the parent population parameters, c and ро, are known to be 
0.82 mg Ке! and 10.8 mg kg ^! respectively, then can we answer the question 
of whether the analytical results given in Table 1 are likely to have come from a 
water sample with a mean sodium level similar to that providing the parent 
data. In statistic's terminology, we wish to test the null hypothesis that the 
means of the sample and the suggested parent population are similar. This is 
generally written as 
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Ho: x= Ho (7) 


і.е. there is no difference between x and po other than that due to random 
variation. The lower the probability that the difference occurs by chance, the 
less likely it is that the null hypothesis is true. In order for us to make the 
decision whether to accept or reject the null hypothesis, we must declare a value 
for the chance of making the wrong decision. If we assume there is less than a 1 
in 20 chance of the difference being due to random factors, the difference is 
significant at the 5% level (usually written as a = 5%). We are willing to accept 
а 5% risk of rejecting the conclusion that the observations are from the same 
source as the parent data if they are in fact similar. 
The test statistic for such an analysis is denoted by z and is given by 


х-н 
am c/ Vn G 


X is 11.04 mg kg ^ !, as determined above, and substituting into Equation (8) 
values for ро and o then 


11.04 — 10.80 
2=—0,82740 18 9) 

The extreme regions of the normal curve containing 5% of the area are 
illustrated in Figure 3 and the values can be obtained from statistical tables. 
The selected portion of the curve, dictated by our limit of significance, is 
referred to as the critical region. If the value of the test statistic falls within this 
area then the hypothesis is rejected and there is no evidence to suggest that the 
samples come from the parent source. From statistic tables, 2.5% of the area is 
below - 1.960 and 97.5% is above 1.960. The calculated value for z of 1.85 
does not exceed the tabulated z-value of 1.96 and the conclusion is that the 
mean sodium concentrations of the analysed samples and the known parent 
sample are not significantly different. 

In the above example it was assumed that the mean value and standard 
deviation of the sodium concentration in the parent sample were known. In 
practice this is rarely possible as all the mineral water from the source would 
not have been analysed and the best that can be achieved is to obtain recorded 
estimates of р and с from repetitive sampling. Both the recorded mean value 
and the standard deviation will undoubtedly vary and there will be a degree of 
uncertainty in the precise shape of the parent normal distribution curve. This 
uncertainty, arising from the use of sampled data, can be compensated for by 
using a probability distribution with a wider spread than the normal curve. The 
most common such distribution used in practice is Student's t-distribution. The 
t-distribution curve is of a similar form to the normal function. As the number 
of samples selected and analysed increases the two functions become increas- 
ingly more similar. Using the ¢-distribution the well known t-test can be 
performed to establish the likelihood that a given sample is a member of a 
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Figure З Areas under the normal curve and z values for some critical regions 


population with specified characteristics. Replacing the z-statistic by the 1- 
statistic implies that we must specify not only the level of significance, a, of the 
test, but also the so-called number of degrees of freedom, i.e. the number of 
independent measures contributing to the set of data. From the data supplied in 
Table 1, is it likely that these samples of mineral water came from a source with 
a mean sodium concentration of more than 10.5 mg kg~'? 

Assuming the samples were randomly collected, then the t-statistic is com- 
puted from 


-Х-с 
іш NE (10) 


where X and s are our calculated estimates of the sample mean and standard 
deviation, respectively. From standard tables, for 39 degrees of freedom, n — 1, 
and with a 5% level of significance the value of 2 is given as 1.68. From 
Equation (10), t = 4.38 which exceeds the tabulated value of ¢ and thus lies in 
the critical region of the r-curve. Our conclusion is that the samples are unlikely 
to arise from a source with a mean sodium level of 10.5 mg kg ^! or less, leaving 
the alternative hypothesis that the sodium concentration of the parent source is 
greater than this. 

The t-test can also be employed in comparing statistics from two different 
samples or analytical methods rather than comparing, as above, one sample 
against a parent population. The calculation is only a little more elaborate, 
involving the standard deviation of two data sets to be used. Suppose the results 
from the analysis of a second day's batch of 40 samples of water give a mean 
value of 10.9 mg kg~! and standard deviation of 0.83 mg kg^!. Are the mean 


Descriptive Statistics 9 


sodium levels from this set and the data in Table 1 similar, and could the 
samples come from the same parent population? 
For this example the t-test takes the form 


X, Х2 


„уйт + in) i 


The quantity s, is the pooled estimate of the parent population standard 
deviation and, for equal numbers of samples in the two sets (m = n2), is given by 


Sp’ = (si? + 2/2 (12) 


where s, and sz are the standard deviations for the two sets of data. 

Substituting the experimental values in Equations (11) and (12) provides a 
t-value of 0.78. Accepting once again a 5% level of significance, the tabulated. 
value of t for 38 degrees of freedom and a = 0.025 is 2.02. (Since the mean of 
one set of data could be significantly higher or lower than the other, an a value 
of 2.5% is chosen to give a combined 5% critical region, a so-called two-tailed 
application.) As the calculated t-value is less than the tabulated value then there 
is no evidence to suggest that the samples came from populations having 
different means. Hence, we accept that the samples are similar. 

The t-test is widely used in analytical laboratories for comparing samples and 
miethods of analysis. Its application, however, relies on three basic assump- 
tions. Firstly, it is assumed that the samples analysed are selected at random. 
This condition is met in most cases by careful design of the sampling procedure. 
The second assumption is that the parent populations from which the samples 
are taken are normally distributed. Fortunately, departure from normality 
rarely causes serious problems providing sufficient samples are analysed. 
Finally, the third assumption is that the population variances are equal. If this 
last criterion is not valid then errors may arise in applying the t-test and this 
assumption should be checked before other tests are applied. The equality of 
variances can be examined by application of the F-test. 

The F-test is based on the F-probability distribution curve and is used to test 
the equality of variances obtained by statistical sampling. The distribution 
describes the probabilities of obtaining specified ratios of sample variance from 
the same parent population. Starting with a normal distribution with variance 
o°, if two random samples of sizes т; and т; are taken from this population and 
the sample variances, 512 and 5:7, calculated then the quotient 512/5:2 will be 
close to unity if the sample sizes are large. By taking repeated pairs of samples 
and plotting the ratio, F = 512/52, the F-distribution curve is obtained.? 

In comparing sample variances, the ratio 512/52 for the two sets of data is 
computed and the probability assessed, from F-tables, of obtaining by chance 
that specific value of F from two samples arising from a single normal popu- 
lation. If it is unlikely that this ratio could be obtained by chance, then this is 
taken as indicating that the samples arise from different parent populations 
with different variances. 
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A simple application of the F-test can be illustrated by examining the mineral 
water data in the previous examples for equality of variance. 
The F-ratio is given by, 


Е = 2/2 (13) 


which for the experimental data gives F = 1.064. 

Each variance has 39 degrees of freedom (и — 1) associated with it, and from 
tables, the F-value at the 5% confidence level is approximately 1.80. The 
F-value for the experimental data is less than this and, therefore, does not lie in 
the critical region. Hence, the hypothesis that the two samples came from 
populations with similar variances is accepted. 

In preceeding examples we have been comparing distributions of variates 
measured in the same units, e.g. mg kg" !. Comparing variates of differing units 
can be achieved by transforming our data by the process called standardization 
which results in new values that have a mean of zero and unit standard 
deviation. Standardization is achieved by subtracting a variable's mean value 
from each individual value and dividing by the standard deviation of the 
variable's distribution. Denoting the new standardized value as z, gives 


zZ=x;—X/s (14) 


Standardization is a common transformation procedure in statistics and 
chemometrics. It should be used with care as it can distort data by masking 
major differences in relative magnitudes between variables. 


Analysis of Variance 


The tests and examples discussed above have concentrated on the statistics 
associated with a single variable and comparing two samples. When more 
samples are involved a new set of techniques is used, the principal methods 
being concerned with the analysis of variance. Analysis of variance plays a 
major role in statistical data analysis and many texts are devoted to the 
subject.*? Here, we will only discuss the topic briefly and illustrate its use in a 
simple example. 

Consider an agricultural trial site sampled to provide six soil samples which 
are subsequently analysed colorimetrically for phosphate concentration. The 
task is to decide whether the phosphate content is the same in each sample. 


4 Н.Г. Youmans, ‘Statistics for Chemists’, J. Wiley, New York, USA, 1973. 

5 G.E.P. Box, W.G. Hunter, and J.S. Hunter, ‘Statistics for Experimenters', J. Wiley, New York, 
USA, 1978. 

6 D.L. Massart, A. Dijkstra, and L. Kaufman, ‘Evaluation and Optimisation of Laboratory 
Methods and Analytical Procedures', Elsevier, London, UK, 1978. 

7 L. Davies, ‘Efficiency in Research, Development and Production: The Statistical Design and 
Analysis of Chemical Experiments', The Royal Society of Chemistry, Cambridge, UK, 1993. 

3 MJ. Adams, in ‘Practical Guide to Chemometrics’, ed. S.J. Haswell, Marcel Dekker, New York, 
USA, 1992, p. 181. 
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Table 3 Concentration of phosphate (mg Ев” +), determined colorimetrically, in 
five sub-samples of soils from six field sites 


Phosphate (mg kg~') 


Sample 1 2 3 4 5 6 
Sub-sample 

i 51 49 56 56 48 56 
ii 54 56 58 48 51 22252 
їй 53 51 52 52 57 52 
іу 48 49 51 58 55 58 


у 47 48 58 51 53 56 


ТаМе4 Commonly used table layout for the analysis of variance (ANOVA) and 
calculation of the F-value statistic 


Source of Sum of Degrees of Mean 

variation squares freedom squares F-Test 
Among samples SS, m-1 SA? Sa? SSW 
Within samples SSy N-m Sw? 

Total variation SS, N-i 51? 


А common problem with this type of data analysis is the need to separate the 
within-sample variance, i.e. the variation due to sample inhomogeneity and 
analytical errors, from the variance which exists due to differences between the 
phosphate content in the samples. The experimental procedure is likely to 
proceed by dividing each sample into sub-samples and determining the phos- 
phate concentration of each sub-sample. This process of analytical replication 
serves to provide a means of assessing the within-sample variations due to 
experimental error. If this is observed to be large compared with the variance 
between the samples it will obviously be difficult to detect the differences 
between the six samples. To reduce the chance of introducing a systematic error 
or bias in the analysis, the sub-samples are randomized. In practice, this means 
that the sub-samples from all six samples are analysed in a random order and 
the experimental errors are confounded over all replicates. The analytical data 
using this experimental scheme is shown in Table 3. The similarity of the six soil 
samples is then assessed by the statistical techniques referred to as one-way 
analysis of variance. Such a statistical analysis of the data is most easily 
performed using an ANOVA (ANalysis Of VAriance) table as illustrated in 
Table 4. 

The total variation in the data can be partitioned between the variation 
amongst the sub-samples and the variation within the sub-samples. The compu- 
tation proceeds by determining the sum of squares for each source of variation 
and then the variances. 
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The total variance for all replicates of all samples analysed is given, from 
Equation (3), by 


m n 


s= Y Y (ху DN - 1) (15) 


іші i=} 


where ху is the ith replicate of the jth sample. The total number of analyses is 
denoted by N, which is equal to the number of replicates per sample, 7, 
multiplied by the number of samples, т. The numerator in Equation (15) is the 
sum of squares for the total variation, SS;, and can be rearranged to simplify 
calculations, 


m 


SS;- Y § x? - 5 $ x| ум (16) 


ізі ізі j=l i=l] 


The variance among the different samples is obtained from 554, 


m 


553 = У 5 х - b Y мл (17) 


j=l j=l i=l 
and the within-sample sum of squares, SSw, can be obtained by difference, 
SSw = $5т — SSA (18) 


For the soil phosphate data, the completed ANOVA table is shown in 
Table 5. 

Once the F-test value has been calculated it can be compared with standard 
tabulated values, using some pre-specified level of significance to check whether 
it lies in the critical region. If it does not, then there is no evidence to suggest 
that the samples arise from different sources and the hypothesis that all the 
values are similar can be accepted. From statistical tables, Ғо01,5,24 = 3.90, and 
since the experimental value of 1.69 does not exceed this then the result is not 
significant at the 1% level and we can accept the hypothesis that there is no 
difference between the six sets of sub-samples. 

The simple one-way analysis of variance discussed above can indicate the 
relative magnitude of differences in variance but provides no information as to 


Table 5 Completed ANOVA table for phosphate data from Table 3 


Source of Sum of Degrees of Mean 

variation Squares freedom squares F-Test 
Among samples 92.8 5 18.56 1.69 
Within samples 264 24 11 


Total variation 356.8 29 
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the source of the observed variation. For example, a single step in an experi- 
mental procedure may give rise to a large degree of error in the analysis. This 
would not be identified by ANOVA because it would be mixed with all other 
sources of variances in calculating SSy. More sophisticated and elaborate 
statistical tests are readily available for a detailed analysis of such data and the 
interested reader is referred to the many statistics texts available." The use of 
the F-test and analysis of variance will be encountered frequently in subsequent 
examples and discussions. 


Outliers 


The suspected presence of rogue values or outliers in a data set always causes 
problems for the analyst. Not only must we be able to detect them, but some 
systematic and reliable procedure for reducing their effect or eliminating them 
may need to be implemented. Methods for detecting outliers depend on the 
nature of the data as well as the data analysis being performed. For the present, 
two commonly employed methods will be discussed briefly. 

The first method is Dixon's Q-test.? The data points are ranked and the 
difference between a suspected outlier and the observation closest to it is 
compared to the total range of measurements. This ratio is the Q-value. As with 
the t-test, if the computed Q-value is greater than tabulated critical values for 
some pre-selected level of significance, then the suspect data value can be 
identified as an outlier and may be rejected. 

Use of this test can be illustrated with reference to the data in Table 6, which 
shows ten replicate measures of the molar absorptivity of nitrobenzene at 
252nm, its wavelength of maximum absorbance. Can the value of 
є = 1056 mol ^! m? be classed as ап outlier? As defined above, 


О = |1056 — 1012|/|1056 — 990| = 0.67 (19) 


For a sample size of 10, and with a 5% level of significance, the critical value 
of Q, from tables, is 0.464. The calculated Q-value exceeds this critical value, 
апа therefore this point may be rejected from subsequent analysis. If necessary, 
the remaining data can be examined for further suspected outliers. 

A second method involves the examination of residuals. A residual is 
defined as the difference between an observed value and some expected, 
predicted or modelled value. If the suspect datum has a residual greater than, 
say, 4 times the residual standard deviation computed from all data, then it may 
be rejected. For the data in Table 6, the expected value is the mean of the ten 
results and the residuals are the differences between each value and this mean. 
The standard deviation of these residuals is 14.00 and the residual for the 
suspected outlier, 49, is certainly less than 4 times this value and, hence, this 


9 S.J. Haswell, in ‘Practical Guide to Chemometrics’, ed. S.J. Haswell, Marcel Dekker, New York, 
USA, 1992, p. 5. 
10 R.G. Brereton, ‘Chemometrics’, Ellis Horwood, Chichester, UK, 1990. 
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Table 6 Molar absorptivity values for nitrobenzene measured at 252 nm 


е (mol ' n? at 252 nm) 


1010 990 978 996 1005 
1002 1056 1012 997 1004 


point should not be rejected. If a 3o criterion is employed then this datum is 
rejected. 

If an outlier is rejected from a set of data then its value can be completely 
removed and the result discarded. Alternatively, the value can be replaced with 
an average value computed from all acceptable results or replaced by the next 
largest, or smallest, measure as appropriate. 

Before leaving this brief discussion of outlier detection and treatment, a 
cautionary warning is appropriate. Testing for outliers should be strictly 
diagnostic, i.e. a means of checking that assumptions regarding the data 
distribution or some selected model are reasonable. Great care should be taken 
before rejecting any data; indeed there is a strong case for stating that no data 
should be rejected. If an outlier does exist, it may be more important to attempt 
to determine and address its cause, whether this be experimental error or some 
failure of the underlying model, rather than simply to reject it from the data. 
Outlier detection and treatment is of major concern to analysts, particularly 
with multivariate data where the presence of outliers may not be immediately 
obvious from visual inspection of tabulated data. Whatever mathematical 
treatment of outliers is adopted, visual inspection of graphical displays cf the 
data prior to and during analysis still remains one of the most effective means of 
identifying suspect data. 


3 Lorentzian Distribution 


Our discussions so far have been limited to assuming a normal, Gaussian 
distribution to describe the spread of observed data. Before proceeding to 
extend this analysis to multivariate measurements, it is worthwhile pointing out 
that other continuous distributions are important in spectroscopy. One distri- 
bution which is similar, but unrelated, to the Gaussian function is the Lorentz- 
ian distribution. Sometimes called the Cauchy function, the Lorentzian distri- 
bution is appropriate when describing resonance behaviour, and it is commonly 
encountered in emission and absorption spectroscopies. This distribution for a 
single variable, x, is defined by 


1 Q1 pi 2 
JO) a (20) 
т (x— uy + (ют 2/2)? 

Like the normal distribution, the Lorentzian distribution is a continuous 
function, symmetric about its mean, p, with a spread characterized by the 
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Figure 4 Comparison of Lorentzian and Gaussian (normal) distributions 


half-width, өз рг. The standard deviation is not defined for the Lorentzian 
distribution because of its slowly decreasing behaviour at large deviations from 
the mean. Instead, the spread is denoted by «1/2, defined as the full-width at half 
maximum height. Figure 4 illustrates the comparison between the normal and 
Lorentzian shapes.” We shall meet the Lorentzian function again in subsequent 
chapters. 


4 Multivariate Data 


To this point, the data analysis procedures discussed have been concerned with 
a single measured variable. Although the determination of a single analyte 
constitutes an important part of analytical science, there is increasing emphasis 
being placed on multi-component analysis and using multiple measures in data 
analysis. The problems associated with manipulating and investigating multiple 
measurements on one or many samples constitutes that branch of applied 
statistics known as multivariate analysis, and this forms a major subject in 
chemometrics.! ^? 

Consideration of the results from a simple multi-element analysis will serve 
to illustrate terms and parameters associated with the techniques used. This 
example will also introduce some features of matrix operators basic to handling 
multivariate data.!^ In the scientific literature, matrix representation of multi- 


п B.F.J. Manly, ‘Multivariate Statistical Analysis: A Primer’, Chapman and Hall, London, UK, 
1991. 8 

12 A.A. Afifi and V. Clark, ‘Computer Aided Multivariate Analysis’, Lifetime Learning, Cali- 
fornia, USA, 1984. 

13 B, Flury and Н. Riedwyl, ‘Multivariate Statistics, A Practical Approach’, Chapman and Hall, 
London, UK, 1988. 

14 MJ.R. Healy, ‘Matrices for Statistics’, Oxford University Press, Oxford, UK, 1986. 
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Table 7 Results from the analysis of mineral water samples by atomic absorption 
spectrometry. Expressed as a data matrix, each column represents a variate and 
each row a sample or object 


Variables (mg kg~') 


Samples Sodium Potassium Calcium Magnesium 
1 10.8 1.6 41.3 7.2 
2 7.1 1.1 72.0 8.0 
3 141 2.0 92.0 8.2 
4 17.0 3.1 117.0 18.0 
5 5.7` 0.4 47.5 16.5 
6 11.3 1.8 62.2 14.6 
Меап = 11.0 17 72.0 12.1 
Variance = 17.8 0.8 812.8 23.3 


variate statistics is common. For those readers unfamiliar with the basic matrix 
operations, or those who wish to refresh their memory, the Appendix provides a 
summary and overview of elementary and common matrix operations. 

The data shown in Table 7 comprise a portion of a multi-element analysis of 
mineral water samples. The data from such an analysis can conveniently be 
arranged in an n by m array, where nis the number of objects, or samples, and m 
is the number of variables measured. This array is referred to as the data matrix 
and the purpose of using matrix notation is to allow us to handle arrays of data 
as single entities rather than having to specify each element in the array every 
time we perform an operation on the data set. Our data matrix can be denoted 
by the single symbol X and each element by xj, with the subscripts i and j 
indicating the number of the row and column respectively. A matrix with only 
one row is termed a row vector, e.g., r, and with only one column, a column 
vector, e.g., c. 

Each measure of an analysed variable, or variate, may be considered 
independent. By summing elements of each column vector the mean and 
standard deviation for each variate can be calculated (Table 7). Although these 
operations reduce the size of the data set to a smaller set of descriptive statistics, 
much relevant information can be lost. When performing any multivariate data 
analysis it is important that the variates are not considered in isolation but are 
combined to provide as complete a description of the total system as possible. 
Interaction between variables can be as important as the individual mean values 
and the distributions of the individual variates. Variables which exhibit no 
interaction are said to be statistically independent, as a change in the value in 
one variable cannot be predicted by a change in another measured variable. In 
many cases in analytical science the variates are not statistically independent, 
and some measure of their interaction is required in order to interpret the data 
and characterize the samples. The degree or extent of this interaction between 
variables can be estimated by calculating their covariances, the subject of the 
next section. 
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Covariance and Correlation 


Just as variance describes the spread of data about its mean value for a single 
variable, so the distribution of multivariate data can be assessed from the 
covariance. The procedure employed for the calculation of variance can be 
extended to multivariate analysis by computing the extent of the mutual 
variability of the variates about some common mean. The measure of this 
interaction is the covariance. 

Equation (3), defining variance, can be written in the form, 


= E x?/n-1) (21) 
where ха — x; — X, or, іп matrix notation, 
82 = xa xa/(n — 1) (22) 


with хт denoting the transpose of the column vector x to form a row vector 
(see Appendix). The numerator in Equations (21) and (22) is the corrected sum 
of squares of the data (corrected by subtracting the mean value and referred to 
as mean centring). To calculate covariance, the analogous quantity is the 
corrected sum of products, SP, which is defined by 


n 


SPa = > (xy — Xy)(xa — Xe) (23) 


ізі 


where x; is the ith measure of variable j, i.e. the value of variable j for object i, 
хк is the ith measure of variable k, and SP, is the corrected sum of products 
between variables j and К. Note that in the special case where j = k Equation 
(23) gives the sum of squares as used in Equation (3). 

Sums of squares and products are basic to many statistical techniques and 
Equation (23) can be simply expressed, using the matrix form, as 


SP = Хаг. Ха (24) 


where X, represents the data matrix after subtracting the column, i.e. variate, 
means. The calculation of variance is completed by dividing by (n — 1) and 
covariance is similarly obtained by dividing each element of the matrix SP by 
(n — 1). 

The steps involved in the algebraic calculation of the covariance between 
sodium and potassium concentrations from Table 7 are shown in Table 8. The 
complete variance-covariance matrix for our data is given in Table 9. 

For the data the variance-covariance matrix, COV,, is square, the number of 
rows and number of columns are the same, and the matrix is symmetric. For a 
symmetric matrix, xj; = xj, and some pairs of entries are duplicated. The 
covariance between, say, sodium and potassium is identical to that between 
potassium and sodium. The variance-covariance matrix is said to have diagonal 
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Table 8 Calculation of covariance between sodium and potassium concentrations 


[Na] [K] 
х; х; х-% х;- X, (x; — X)(x, — X) 
10.8 1.6 —0.2 — 0.07 0.014 
7.1 1.1 = 3.9 — 0.57 © 2.233 
14.1 2.0 3.1 0.33 1.023 
17.0 3.1 6.0 1.43 8.580 
5.7 0.4 - 5.3 - 1.27 6.731 
11.3 1.8 0.3 0.13 0.039 
Х- 11.0 1.67 18.610 
У = 66.0 10.0 
52 = 17.81 0.82 


SPy, x = 128.61 — [(66.0)-(10.0)]/6 
= 18.61 
COVw, x = 18.61/5 = 3.72 
Table 9 Symmetric variance—covariance matrix for the analytes in Table 7. The 


diagonal elements are the variances of individual variates; off-diagonal 
elements are covariances between variates 


Sodium Potassium Calcium Magnesium 
Sodium 17.81 3.72 93.01 3.54 
Potassium 3.72 0.82 20.59 0.91 
Calcium 93.01 20.59 812.76 41.13 
Magnesium 3.54 0.91 41.13 23.29 


symmetry with the diagonal elements being the variances of the individual 
variates. 

In Figure 5(a) a scatter plot of the concentration of sodium vs. the concen- 
tration of potassium, from Table 7, is illustrated. It can be clearly seen that the 
two variates have a high interdependence, compared with magnesium vs. 
potassium concentration, Figure 5(b). Just as the absolute value of variance is 
influenced by the units of measurement, so covariance is similarly affected. To 
estimate the degree of interrelation between variables, free from the effects of 
measurement units, the correlation coefficient can be employed. The linear 
correlation coefficient, гу, between two variables j and k is defined by, 


rj = Covarianceg/s; sy Q5) 
As the value for covariance can equal but never exceed the product of the 


standard deviations, values for r range from - 1 to + 1. The complete corre- 
lation matrix for the elemental data is presented in Table 10. 
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Figure 5 There is a higher correlation, dependence, between the concentrations of sodium 
and potassium than between magnesium and potassium. Data from Table 7 


Figure 6 illustrates a series of scatter plots between variates having corre- 
lation coefficients between the two possible extremes. А correlation coefficient 
close to +1 indicates a high positive interdependence between variates, 
whereas a negative value means that the value of one variable decreases as the 
other increases, i.e. a strong negative interdependence. A value of r near zero 
indicates that the variables are linearly independent. 

Correlation as a measure of similarity and association between variables is 
often used in many aspects of chemometrics. Used with care, it can assist in 
selecting variables for data analysis as well as providing a figure of merit as to 
how good a mathematical model fits experimental data, e.g. in constructing 
calibration curves. Returning to the extreme data set of Table 2, the correlation 
coefficient between chromium and nickel concentrations is identical for each 
source of water. If the data are plotted, however, some of the dangers of 
quoting r values are evident. From Figure 7, it is reasonable to propose a linear 
relationship between the concentrations of chromium and nickel for samples 


Table 10 Correlation matrix for the analytes in Table 7. The matrix is sym- 
metric about the diagonal and values lie in the range —1 to +1 


Sodium Potassium Calcium Magnesium 
Sodium 1.00 0.97 0.77 0.17 
Potassium 0.97 1.00 0.80 0.21 
Calcium 0.77 0.80 1.00 0.30 


Magnesium 0.17 0.21 0.30 1.00 
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Figure 6 Scatter plots for bivariate data with various values of correlation coefficient, т. 
Least-squares best-fit lines are also shown. Note that correlation is only a 
measure of linear dependence between variates 


from A. This is certainly not the case for samples B, and the graph suggests that 
a higher order, possibly quadratic, model would be better. For samples from 
source C, a potential outlier has reduced an otherwise excellent linear corre- 
lation, whereas for source D there is no evidence of any relationship between 
chromium and nickel but an outlier has given rise to a high correlation 
coefficient. To repeat the earlier warning, always visually examine the data 
before proceeding with any manipulation. 
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Figure 7 Scatter plots of the concentrations of chromium vs. nickel from four waste water 
sources, from Table 2 


Multivariate Normal 


In much the same way as the more common univariate statistics assume a 
normal distribution of the variable under study, so the most widely used 
multivariate models are based on the assumption of a multivariate normal 
distribution for each population sampled. The multivariate normal distribution 
is a generalization of its univariate counterpart and its equation in matrix 
notation is 
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1 
Дх) = Оту"? сон РІ – 3(x – в)" COV, (х — p)] (26) 


The representation of this equation for anything greater than two variates is 
difficult to visualize, but the bivariate form (m = 2) serves to illustrate the 
general case. The exponential term in Equation (26) is of the form x Ах and is 
known as a quadratic form of a matrix product (Appendix A). Although the 
mathematical details associated with the quadratic form are not important for 
us here, one important property is that they have a well known geometric 
interpretation. All quadratic forms that occur in chemometrics and statistical 
data analysis expand to produce a quadratic surface that is a closed ellipse. Just 
as the univariate norma] distribution appears bell-shaped, so the bivariate 
normal distribution is elliptical. 

For two variates, хі and x2, the mean vector and variance-covariance matrix 
are defined in the manner as discussed above. 


[9 u^ o 
COV, = ЕЕ 2 (27) 
where ш, and џ аге the means of x, and x» respectively, с? and o>,” are their 
variances, and 0122 = 0212 is the covariance between x, and xz. Figure 8 
illustrates some bivariate normal distributions, and the contour plots show the 
lines of equal probability about the bivariate mean, i.e. lines that connect points 
having equal probability of occurring. The contour diagrams of Figure 8 may 
be compared to the correlation plots presented previously. As the covariance, 
o2, increases in a positive manner from zero, so the association between the 
variates increases and the spread is stretched, because the variables serve to act 
together. If the covariance is negative then the distribution moves in the other 
direction. 


5 Displaying Data 


As our discussions of population distributions and basic statistics have pro- 
gressed, the use of graphical methods to display data can be seen to play an 
important role in both univariate and multivariate analysis. Suitable data plots 
can be used to display and describe both raw data, i.e. original measures, and 
transformed or manipulated data. Graphs can aid in data analysis and inter- 
pretation, and can serve to summarize final results.'? The use of diagrams may 
help to reveal patterns in the data which may not be obvious from tabulated 
results. With most computer-based data analysis packages the graphics support 


15 J.M. Thompson, in ‘Methods for Environmental Data Analysis’, ed. C.N. Hewitt, Elsevier 
Applied Science, London, UK, 1992, p. 213. 
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Covariance = 0 





Covariance < 0 





Бірше8 Bivariate normal distributions as probability contour plots for data having 
different covariance relationships 


can provide a valuable interface between the user and the experimental data. 
The construction and use of graphical techniques to display univariate and 
bivariate data are well known. The common calibration graph or analytical 
working curve, relating, for example, measured absorbance to sample con- 
centration, is ubiquitous in analytical science. No spectroscopist would 
welcome the sole use of tabulated spectra without some graphical display of the 
spectral pattern. The display of data obtained from more than two variables, 
however, is less common and a number of ingenious techniques and methods 
have been proposed and utilized to aid in the visualization of such multivariate 
data sets. With three variables a three-dimensional model of the data can be 
constructed and several graphical computer packages are available to assist in 
the design of three-dimensional plots.'* In practice, the number of variables 
examined may well be in excess of two or three and less familiar and less direct 
techniques are required to display the data. Such techniques are generally 
referred to as mapping methods as they attempt to represent a many- 
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Figure 10 Data from Table 11 displayed as Chernoff faces 


dimensional data set in a reduced, usually two-dimensional space whilst retain- 
ing the structure and as much information from the original data as possible. 
For bivariate data the simple scatter plot of variate x against variate y is 
popular and there are several ways in which this can be extended to accommo- 
date further variables. Figure 9 illustrates an example of a three-dimensional 
scatter plot. The data used are from Table 11, representing the results of the 
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Table 11 XRF results from copper-based alloys 


Variables (% by weight) 


Samples Tin Zinc Tron Nickel 
1 0.20 3.40 0.06 0.08 
2 0.20 2.40 0.04 0.06 
3 0.15 2.00 0.08 0.16 
4 0.61 6.00 0.09 0.02 
5 0.57 4.20 0.08 0.06 
6 0.58 4.82 0.07 0.02 
7 0.30 5.60 0.02 0.01 
8 0.60 6.60 0.07 0.06 
9 0.10 1.60 0.05 0.19 


analysis of nine alloys for four elements. The concentration of three analytes, 
zinc, tin, and iron, are displayed. It is immediately apparent from the illustra- 
tion that the samples fall into one of two groups, with one sample lying between 
the groups. This pattern in the data is more readily seen in the graphical display 
than from the tabulated data. 

This style of representation is limited to three variables and even then the 
diagrams can become confusing, particularly if there are a lot of points to plot. 
One method for graphically representing multivariate data ascribes each vari- 
able to some characteristic of a cartoon face. These Chernoff faces have been 
used extensively in the social sciences and adaptations have appeared in the 
analytical chemistry literature. Figure 10 illustrates the use of Chernoff faces to 
represent the data from Table 11. The size of the forehead is proportional to tin 
concentration, the lower Їасе to zinc level, eyebrows to nickel, and mouth shape 
to iron concentration. As with the three-dimensional scatter plot, two groups 
can be seen, samples 1, 2, 3, and 9, and samples 4, 5, 6, and 8, with sample 7 
displaying characteristics from both groups. 

Star-plots present an alternative means of displaying the same data (Figure 
11), with each ray size proportional to individual analyte concentrations. 

A serious drawback with multidimensional representation is that visually 
some characteristics are perceived as being of greater importance than others 
and it is necessary to consider carefully the assignment of the variable to the 
graph structure. In scatter plots, the relationships between the horizontal 
co-ordinates can be more obvious than those for the higher-dimensional data 
on a vertical axis. It is usually the case, therefore, that as well as any strictly 
analytical reason for reducing the dimensionality of data, such simplification 
can aid in presenting multidimensional data sets. Thus, principal components 
and principal co-ordinates analysis are frequently encountered as graphical 
aids as well as for their importance in numerically extracting information from 
data. It is important to realize, however, that reduction of dimensionality can 
lead to loss of information. Two-dimensional representation of multivariate 
data can hide structure as well as aid in the identification of patterns. 


26 Chapter 1 


- 
ю 
[^ 
^ 
a 


\ 


6 7 8 9 


Figure 11 Star plots of data from Table 11 


The wide variety of commercial computer software available to the analyst 
for statistical analysis of data has contributed significantly to the increasing use 
and popularity of multivariate analysis. It still remains essential, however, that 
the chemist appreciates the underlying theory and assumptions associated with 
the tests performed. In this chapter, only a brief introduction to the funda- 
mental statistics has been presented. The remainder of the book is devoted to 
the acquisition, manipulation, and interpretation of spectrochemical data. No 
attempt has been made to present computer algorithms or program listings. 
Many fine texts are available that include details and listings of programs for 
numerical and statistical analysis for the interested reader.!6-!? 


16 А.Е. Carley and P.H. Morgan, ‘Computational Methods in the Chemical Sciences’, Ellis 
Horwood, Chichester, UK, 1989. 

17 W.H. Press, В.Р. Flannery, S.A. Teukolsky, and W.T. Vetterling, ‘Numerical Recipes’, Cam- 
bridge University Press, Cambridge, UK, 1987. 

18 J, Zupan, ‘Algorithms for Chemists’, Wiley, New York, USA, 1989. 

19 J.C. Davis, ‘Statistics and Data Analysis in Geology’, J. Wiley and Sons, New York, USA, 1973. 


СНАРТЕК 2 


Acquisition and Enhancement 
of Data 


1 Introduction 


In the modern spectrochemical laboratory, even the most basic of instruments 
is likely to be microprocessor controlled, with the signal output digitized. Given 
this situation, it is necessary for analysts to appreciate the basic concepts 
associated with computerized data acquisition and signal conversion to the 
digital domain. After all, digitization of the analytical signal may represent one 
of the first stages in the data acquisition and manipulation process. If this is 
incorrectly carried out then subsequent processing may not be worthwhile. The 
situation is analogous to that of analytical sampling. If a sample is not 
representative of the parent material, then no matter how good the chemistry or 
the analysis, the results may be meaningless or misleading. 

The detectors and sensors commonly used in spectrometers are analogue 
devices; the signal output represents some physical parameter, e.g. light inten- 
sity, as a continuous function of time. In order to process such data in the 
computer, the continuous, or analogue, signal must be digitized to provide a 
series of numeric values equivalent to and representative of the original signal. 
An important parameter to be selected is how fast, or at what rate, the input 
signal should be digitized. One answer to the problem of selecting an appro- 
priate sampling rate would be to digitize the signal at as high a rate as possible. 
With modern high-speed, analogue-to-digital converters, however, this would 
produce so much data that the storage capacity of the computer would soon be 
exceeded. Instead, it is preferred that the number of values recorded is limited. 
The analogue signal is digitally and discretely sampled, and the rate of sampling 
determines the accuracy of the digital representation as a time discrete function. 


2 Sampling Theory 


Figure 1 illustrates a data path in a typical ratio-recording, dispersive infrared 
spectrometer. The digitization of the analogue signal produced by the detector 


1 M.A. Ford, іп ‘Computer Methods in UV, Visible and IR Spectroscopy’, ed. W.O. George and 
H.A. Willis, The Royal Society of Chemistry, Cambridge, UK, 1990, p. 1. 
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corrected sample and ref. 
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"RAW" SPECTRAL DATA 





OUTPUT SPECTRAL DATA 


Figure 1 Data path of a ratio-recording, dispersive IR spectrometer 
(Reproduced by permission from ref. 1) 


is a critical step in the generation of the analytical spectrum. Sampling theory 
dictates that a continuous time signal can be completely recovered from its 
digital representation if the original analogue signal is band-limited, and if the 
sampling frequency employed for digitization is at least twice the highest 
frequency present in the analogue signal. This often quoted statement is 
fundamental to digitization and is worth examining in more detail. 

The process of digital sampling can be represented by the scheme shown in 
Figure 2.2 The continuous analytical signal as a function of time, x,, is multi- 
plied by a modulating signal comprising a train of pulses of equal magnitude 
and constant period, р,. The resultant signal is a train of similar impulses but 
now with amplitudes limited by the spectral envelope x, We wish the digital 
representation accurately to reflect the original analogue signal in terms of all 
the frequencies present in the original data. Therefore, it is best if the signals are 
represented in the frequency domain (Figure 3). This is achieved by taking the 
Fourier transform of the spectrum. 

Figure 3 illustrates the Fourier transform, xy, of the analytical signal x,.? At 
frequencies greater than some value, fim, xr is zero and the signal is said to be 
band-limited. Figure 3(b) shows the frequency spectrum of the modulating 
pulse train. The sampled signal, Figure 3(c), is repetitive with a frequency 
determined by the sampling frequency of the modulating impulses, fs. These 


2 A.V. Openheim and A.S. Willsky, ‘Signals and Systems’, Préntice-Hall, New Jersey, USA, 1983. 
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Figure 2 A schematic of the digital sampling process: (а) A signal, x,, is multiplied by а 
train of pulses, p,, producing the signal x, ,; (b) The analytical signal, x,; (c) The 
carrier signal, p,; (d) The resultant sampled signal is a train of pulses with 
amplitudes limited by x, 
(Reproduced by permission from ref. 2) 


modulating impulses have a period, 1, given by t= 1//,. It is evident from 
Figure 3(c) that the sampling rate as dictated by the modulating signal, f,, must 
be greater than the maximum frequency present in the spectrum, fm. Not only 
that, it is necessary that the difference, (fs — fm), must be greater than fm, i.e. 


(f: —Sm) > Sm or А> 2fn (1) 


У. 2fm is referred to as the minimum or Nyquist sampling frequency. If the 
sampling frequency, fs, is less than the Nyquist value then aliasing arises. This 
effect is illustrated in Figure 3(d). At low sampling frequencies the spectral 
pattern is distorted by overlapping frequencies in the analytical data. 

In practice, analytical signals are likely to contain a large number of very 
high-frequency components and, as pointed out above, it is impractical simply 
to go on increasing the digitizing rate. The situation may be relieved by 
applying a low pass filter to the raw analogue signal to remove high-frequency 
components and, hence, derive a band-limited analogue signal for subsequent 
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Figure 3 Sampling in the frequency domain; (a) Modulated signal, x, has frequency 
spectrum ху; (b) Harmonics of the carrier signal; (c) Spectrum of modulated 
signal is a repetitive pattern of x,, and x, can be completely recovered by low pass 
filtering using, for example, a box filter with cut-off frequency f.; (d) Too low a 
sampling frequency produces aliasing, overlapping of frequency patterns 
(Reproduced by permission from ref. 2) 


digital sampling. Provided that the high-frequency analogue information lost 
by this filtering is due only to noise, then the procedure is analytically valid. In 
the schematic diagram of Figure 1, this preprocessing function is undertaken by 
the integration stage prior to digitization. 

Having digitized the analogue signal and obtained an accurate representation 
of the analytical information, the data can be manipulated further to aid the 
spectroscopist. One of the most common data processing procedures is digital 
filtering or smoothing to enhance the signal-to-noise ratio. Before discussing 
filtering, however, it will be worthwhile considering the concept of the signal- 
to-noise ratio and its statistical basis. 


3 Signal-to-Noise Ratio 


The spectral information used in an analysis is encoded as an electrical signal 
from the spectrometer. In addition to desirable analytical information, such 
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signals contain an undesirable component termed noise which can interfere 
with the accurate extraction and interpretation of the required analytical data. 

There are numerous sources of noise that arise from instrumentation, but 
briefly the noise will comprise flicker noise, interference noise, and white noise. 
These classes of noise signals are characterized by their frequency distribution. 
Flicker noise is characterized by a frequency power spectrum that is more 
pronounced at low frequencies than at high frequencies. This is minimized in 
instrumentation by modulating the carrier signal and using a.c. detection and 
a.c. signal processing, e.g. lock-in amplifiers. Interference from power supplies 
may also add noise to the signal. Such noise is usually confined to specific 
frequencies about 50 Hz, or 60 Hz, and their harmonics. By employing modu- 
lation frequencies well away from the power line frequency, interference noise 
can be reduced, and minimized further by using highly selective, narrow- 
bandpass electronic filters. White noise is more difficult to eliminate since it is 
random in nature, occurring at all frequencies in the spectrum. It is a funda- 
mental characteristic of all electronic instruments. In recording a spectrum, 
complete freedom from noise is an ideal that can never be realized in practice. 
The noise associated with a recorded signal has a profound effect in an analysis 
and one figure of merit used to describe the quality of a measurement is the 
signal-to-noise ratio, S/N, which is defined as, 


_ average signal magnitude 


S/N - 
rms noise 


Q) 


The rms (room mean square) noise is the square root of the average deviation 
of the signal, x; from the mean noise value, i.e. 


rms noise — 1 / зааг (3) 


This equation should be recognized as equating rms noise with the standard 
deviation of the noise signal, с. S/N can, therefore, be defined as X/o. 

In spectrometric analysis S/N is usually measured in one of two ways. The 
first technique is repeatedly to sample and measure the analytical signal and 
determine the mean and standard deviation using Equation (3). Where a chart 
recorder output is available, then a second method may be used. Assuming the 
noise is random and normally distributed about the mean, it is likely that 99% 
of the random deviations in the recorded signal will lie within + 2.50 of the 
mean value. By measuring the peak-to-peak deviation of the signal and dividing 
by 5, an estimate of the rms noise is obtained. The use of this method is 
illustrated in Figure 4. Whichever method is used, the signal should be sampled 
for sufficient time to allow a reliable estimate of the standard deviation to be 
made. When measuring S/N it is usually assumed that the noise is independent 
of signal magnitude for small signals close to the baseline or background signal. 

Noise, as well as affecting the appearance of a spectrum, influences the 
sensitivity of an analytical technique and for quantitative analysis the S/N ratio 
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о 20 40 60 80 100 120 
time 
Figure 4 Amplified trace of an analytical signal recorded with amplitude close to the 


background level, showing the mean signal amplitude, S, and the standard 
deviation, s. The peak-to-peak noise is 5s 


is of fundamental importance. Analytical terms dependent on the noise con- 
tained in the signal are the decision limit, the detection limit, and the determi- 
nation limit. These analytical figures of merit are often quoted by instrument 
manufacturers and a knowledge of their calculation is important in evaluating 
and comparing instrument performance in terms of analytical sensitivity. 


4 Detection Limits 


The concept of an analytical detection limit implies that we can make a 
qualitative decision regarding the presence or absence of analyte in a sample. In 
arriving at such a decision there are two basic types of error that can arise 
(Table 1). The Type I error leads to the conclusion that the analyte is present in 
a sample when it is known not to be, and the Type II error is made if we 
conclude that the analyte is absent, when in fact it is present. The definition of a 
detection limit should address both types of error.? 


Table 1 The Type I and Type II errors that can be made in accepting or rejecting 
a statistical hypothesis 


HYPOTHESIS HYPOTHESIS 


IS CORRECT IS INCORRECT 
HYPOTHESIS IS ACCEPTED Correct decision Type H Error 
HYPOTHESIS IS REJECTED Type I Error Correct decision 


3 J.C. Miller and J.N. Miller, ‘Statistics for Analytical Chemistry’, Ellis Horwood, Chichester, UK, 
1993. 
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Figure 5 (a) The normal distribution with the 5% critical region highlighted. Two 
normally distributed signals with equal variances overlapping, with the mean of 
one located at the 5% point of the other (b) — the decision limit; overlapping at 
their 5% points with means separated by 3.3с (c) — the detection limit; and their 
means separated by 10е (d) — the determination limit 


Consider an analytical signal produced by a suitable blank sample, with a 
mean value of рь. If we assume that noise in this background measurement is 
random and normally distributed about ць, then 95% of this noise will lie 
within рь + 1.650 (Figure 5). With a 5% chance of committing a Type I error, 
then any analysis giving a response value greater than рь + 1.650 can be 
assumed to indicate the presence of the analyte. This measure is referred to as 
the decision limit, 


Decision Limit = zo55: 0y = 1.650, (4) 


If the number of measurements made to calculate oy is small, then the 
appropriate value from the /-distribution should be used in place of the z-value 
as obtained from the normal distribution curve. 

We may ask, what if a sample containing analyte at a concentration equiv- 
alent to the decision limit is repeatedly analysed? In such a case, we can expect 
that in 50% of the measurements the analyte will be reported present, but in the 
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other half of the measurements the analyte will be reported as not present. This 
attempt at defining a detection limit using the decision limit defined by Equa- 
tion (4) does not address the occurrence of the Type II error. 

If, as with the Type I error, we are willing to accept a 5% chance of 
committing a Type II error, then the relationship between the blank signal and 
sample measurement is as indicated in Figure 5(b). This defines the detection 
limit, 


Detection Limit = 220950, = 3.36, (5) 


Under these conditions, we have a 5% chance of reporting the analyte 
present in a blank solution, and a 5% chance of reporting the analyte absent in 
a sample actually containing analyte at the concentration defined by the 
detection limit. 

We should examine the precision of measurements made at this limit before 
accepting this definition of detection limit. The repeated measurement of the 
instrumental response from a sample containing analyte at the detection limit 
will lead to the analyte being reported as below the detection limit for 50% of 
the analyses. The relative standard deviation, RSD, of such measurements is 
given by 


RSD = 1000/p = 100/(2z0.9s) = 30.3% (6) 


This hardly constitutes suitable precision for quantitative analysis, which 
should have a RSD of 10% or less. For a RSD of 10%, a further term can be 
defined called the determination limit, Figure 5(d), 


Determination Limit = 106, (7) 


When comparing methods, therefore, the defining equations should be 
identified and the definitions used should be agreed. 

As we can see, the limits of quantitative analysis are influenced by the noise in 
the system and to improve the detection limit it is necessary to enhance the 
signal-to-noise ratio. 


5 Reducing Noise 


If we assume that the analytical conditions have been optimized, say to produce 
maximum signal intensity, then any increase in signal-to-noise ratio will be 
achieved by reducing the noise level. Various strategies are widely employed to 
reduce noise, including signal averaging, smoothing, and filtering. It is common 
in modern spectrometers for several methods to be used on the same analytical 
data at different stages in the data processing scheme (Figure 1). 


Signal Averaging 


The process of signal averaging is conducted by repetitively scanning and 
co-adding individual spectra. Assuming the noise is randomly distributed, then 


Acquisition and Enhancement of Data 35 


the analytical signals which are coherent in time are enhanced, since the signal 
grows linearly with the number of scans, N, 


signal magnitude « N 
signal magnitude = kı N (8) 


To consider the effect of signal averaging on the noise level we must refer to 
the propagation of errors. The variance associated with the sum of independent 
errors is equal to the sum of their variances, i.e. 


of = No? (9) 


см = 


imz 


Since we can equate rms noise with standard deviation then, 
су = V (Na?) (10) 


Thus the average magnitude of random noise increases at a rate proportional 
to the square root of the number of scans, 


noise magnitude œ М1/2 
noise magnitude = k, М1/2 (11) 


Therefore, 


signal АМ 
noise k,N'/? 








= км? (12) 


and the signal-to-noise ratio is improved at a rate proportional to the square 
root of the number of scans. Figure 6 illustrates part of an infrared spectrum 
and the effect of signal averaging 4, 9, and 16 spectra. The increase in signal-to- 
noise ratio associated with increasing the number of co-added repetitive scans is 
evident. 

For signal averaging to be effective, each scan must start at the same place 
in the spectrum otherwise analytical signals and useful information will also 
cancel and be removed. The technique is widely used but is most common in 
fast scanning spectrometers, particularly Fourier transform instruments such 
as NMR and IR. Co-adding one hundred scans is common in infrared 
spectroscopy in order to achieve a theoretical enhancement of 10:1 in signal- 
to-noise ratio. Whilst further gains can be achieved, practical considerations 
may limit the process. Even with a fast scan, say 1s, the time required to 
perform 10000 scans and aim to achieve a 100-fold improvement in signal-to- 
noise ratio may be unacceptable. In addition, computer memory constraints 
on storing the accumulated spectra may limit the maximum number of scans 
permitted. 
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16 scans 


Е = 


Figure 6 Ал infrared spectrum and the results of co-adding 4, 9, and 16 scans from the 
same region 





Signal Smoothing 


A wide variety of mathematical manipulation schemes are available to smooth 
spectral data, and in this section we shall concentrate on smoothing techniques 
that serve to average a section of the data. They are all simple to implement on 
personal computers. This ease of use has led to their widespread application, 
but their selection and tuning is somewhat empirical and depends on the 
application in-hand. 

One simple smoothing procedure is boxcar averaging. Boxcar averaging 
proceeds by dividing the spectral data into a series of discrete, equally spaced, 
bands and replacing each band by a centroid average value. Figure 7 illustrates 
the results using the technique for different widths of the filter window or band. 
The greater the number of points averaged, the greater the degree of smoothing, 
but there is also a corresponding increase in distortion of the signal and 
subsequent loss of spectral resolution. The technique is derived from the use of 
electronic boxcar integrator units. It is less widely used in modern spectrometry 
than the methods of moving average and polynomial smoothing. 

As with boxcar averaging, the moving average method replaces a group of 
values by their mean value. The difference in the techniques is that with the 
moving average successive bands overlap. Consider the spectrum illustrated in 
Figure 8, which is comprised of transmission values, denoted x;. By averaging 
the first five values, i= 1... 5, a mean transmission value is produced which 
provides the value for the third data point, хз, in the smoothed spectrum. The 
procedure continues by incrementing i and averaging the next five values to find 
х from original data x2, xs, x4, xs, and хс. The degree of smoothing achieved is 
controlled by the number of points averaged, i.e. the width of the smoothing 
window. Distortion of the data is usually less apparent with the moving average 
method than with boxcar averaging. 

Тһе mathematical process of implementing the moving average technique is 
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Figure 7 Ап infrared spectrum and the results of applying а 5-point boxcar average, а 
7-point average, and a 9-point average 
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Figure 8 Smoothing with а 5-point moving average. Each new point in the smoothed 
spectrum is formed by averaging a span of 5 points from the original data 
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Figure 9 Convolution of a spectrum with a filter is achieved by pulling the filter function 
across the spectrum 


termed convolution. The resultant spectrum, x' (as a vector), is said to be the 
result of convolution of the original spectrum vector, x, with a filter function, w, 
i.e. 


х= юх (13) 


For the simple five-point moving average, w = [1,1,1,1,1]. Тһе mechanism 
and application of the convolution process can be visualized graphically as 
illustrated in Figure 9. 

In 1964 Savitzky and Golay described a technique for smoothing spectral 
data using convolution filter vectors derived from the coefficients of least- 
squares-fit polynomial functions.^ This paper, with subsequent arithmetic cor- 
rections,? has become a classic in analytical signal processing and least-squares 
polynomial smoothing is probably the technique in widest use in spectral data 
processing and manipulation. To appreciate its derivation and application we 
should extend our discussion of the moving average filter. 

The simple moving average technique can be represented mathematically by 


dus Y хуш / Y o (14) 


ј= –п йт-п 


where x; and х', are elements of the original and smoothed data vectors 
respectively, and the values w; are the weighting factors in the smoothing 
window. For a simple moving average function, «; — 1 for all j and the width of 
the smoothing function is defined by (2n + 1) points. 


* A. Savitzky and M.J.E. Golay, Anal. Chem., 1964, 36, 1627. 
5 J. Steiner, Y. Termonia, and J. Deltour, Anal. Chem., 1972, 44, 1906. 
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The process of polynomial smoothing extends the principle of the moving 
average by modifying the weight vector, ө, such that the elements of @ describe 
a convex polynomial. The central value in each window, therefore, adds more 
to the averaging process than values at the extremes of the window. 

Consider five data points forming a part of a spectrum described by the data 
set x recorded at equal wavelength intervals. Polynomial smoothing seeks to 
replace the value of the point x; by a value calculated from the least-squares 
polynomial fitted to x; », xj-1, x; Xj+1, and x;+2 recorded at wavelengths 
denoted by № 2, Nj- 1% №, N j+ls and Ху+2. 

For a quadratic curve fitted to the data, the model can be expressed as 


х' = ао + а + aM (15) 


where x’ is the fitted model data and ао, a, and a; аге the coefficients or weights 
to be determined. 

Using the method of least squares, the aim is to minimize the error, e, given 
by the square of the difference between the model function, Equation (13) and 
the observed data, for all data values fitted, i.e. 


є = Xx — х)? = [a+ a Y Nyt a M- Sx] (16) 


jaan 


and, by simple differential calculus, this error function is a minimum when its 
first derivative is zero. 

Differentiating Equation (16) with respect to ао, а), and а; respectively, 
provides a set of so-called normal equations, 


ao Y (ta Ум+ о УА = Ух; 


ј= -n 


a DA +a DAP + а; УА = Хх), (17) 
а S M? t а, Ом + а УХ = Ух; 


Because the А, values are equally spaced, AX = dj — № _, is constant and опу 
relative А values are required for the model, 


hj = j.A (18) 
Hence, for j= —2... + 2 (a five-point fit), 
Xy'-AXXj'-0 
EN? = AXE = І0АА 


XA)-AXEIP-0 (19) 
XXj- АА.“ = 34A 
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which can be substituted into the normal equations, Equations (17), giving 


Sao + 10AM.a; = Ux; = ху—2 + хур + Xj ху + X42 
10AX.a; = Ex. = — 2xj-2 — Xj-1 + Xj+1 7 2х;+2 
10 + 34AM.a; - Ex? - 4xj-2 + Xj-1 + Xj+1 + 4ху+2 (20) 


which can be rearranged, 


ао ( = 3х;-2 + 12x; t 17x; + 12ху+1 = 3х;+2).1/35 
а = ( = 2xj-2 — Xj-1 t Xj+1 7 2x;+2).1/10AX 
a, = (2-2 — ху, — 2ху— хал + 2x43).1/14AM? (21) 


At the central point in the smoothing window, А; = 0 and х'; = ау from 
Equation (15). The five weighting coefficients, о), are given by the first equation 
in Equation (21), 


6 2 [ —3, 12, 17, 12, – 3] (22) 


Savitzky and Golay published the coefficients for a range of least-squares fit 
curves with up to 25-point wide smoothing windows for each.* Corrections to 
the original tables have been published by Steinier er а/.5 

Table 2 presents the weighting coefficients for performing 5, 9, 13, and 
17-point quadratic smoothing and the results of applying these functions to the 
infrared spectral data are illustrated in Figure 10. 

When choosing to perform a Savitzky-Golay smoothing operation on spec- 
tral data, it is necessary to select the filtering function (quadratic, quartic, etc.), 
the width of the smoothing function (the number of points in the smoothing 
window), and the number of times the filter is to be applied successively to the 


17-point 
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3000 cm" 2800 cm" 


Figure 10 Savitzky-Golay quadratic smoothing of the spectrum from Figure 7(a) using a 
5-point plan (a), a 9-point span (b), a 13-point span (c), and a 17-point span (d) 
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ТаМе2 Savitzky—Golay coefficients, or weightings, for 5-, 9-, 13-, and 17-point 
quadratic smoothing of continuous spectral data 


Points 17 13 9 5 
-8 -21 
-7 -6 
-6 | 7 ~ 11 
-5 18 0 
-4 27 9 -21 
-3 34 16 14 
-2 39 21 39 -3 
-1 42 24 54 12 
0 43 25 59 17 
1 42 24 54 12 
2 39 21 39 -3 
3 34 16 14 
4 27 9 — 21 
5 18 0 
6 7 - 1 
7 -6 
8 -21 
Norm 323 143 231 35 


data. Although the final choice is largely empirical, the quadratic function is the 
most commonly used, with the window width selected according to the scan- 
ning conditions. A review and account of selecting a suitable procedure has 
been presented by Enke and Nieman.°® 


Filtering in the Frequency Domain 


The smoothing operations discussed above have been presented in terms of the 
action of filters directly on the spectral data as recorded in the time domain. By 
converting the analytical spectrum to the frequency domain, the performance 
of these functions can be compared and a wide variety of other filters designed. 
Time-to-frequency conversion is accomplished using the Fourier transform. Its 
use was introduced earlier in this chapter in relation to sampling theory, and its 
application will be extended here. 

The electrical output signal from a conventional scanning spectrometer 
usually takes the form of an amplitude-time response, e.g. absorbance vs. 
wavelength. All such signals, no matter how complex, may be represented as a 
sum of sine and cosine waves. The continuous function of composite frequen- 
cies is called a Fourier integral. The conversion of amplitude-time, 1, infor- 
mation into amplitude-frequency, w, information is known as a Fourier trans- 
formation. The relation between the two forms is given by 


6 С.С. Enke and Т.А. Nieman, Anal. Chem., 1976, 48, 705. 
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Fw) = | А f(Ofcos(wr) + i.sin(wt)]dt (23) 


or, іп complex exponential form, 


F(w) = | ” f()e-?*" dr (4) 


The corresponding reverse, or inverse, transform, converting the complex 
frequency domain information back to the time domain is 


Ў) = | 7 F(w)e?ri"t dw Q5) 


— oo 


The two functions f(t) and F(w) are said to comprise Fourier transform pairs. 

Аз discussed previously with regard to sampling theory, real analytical 
signals are barid-limited. The Fourier equations therefore should be modified 
for practical use as we cannot sample an infinite number of data points. With 
this practical constraint, the discrete forward complex transform is given by 


N-1 
F(n) = Y, f(k)e 2 (26) 
k=0 
and the inverse is 
1 N-1 4 
Sk) =-; X Faye (27) 
n=0 


A time domain spectrum consists of N points acquired at regular intervals 
and it is transformed to a frequency domain spectrum. This consists of N/2 real 
and N/2 imaginary data points, with n = — N/2...0... N/2, and К takes 
integer values from 0 to N — 1. 

Once a frequency spectrum of a signal is computed then it can be modified 
mathematically to enhance the data in some well defined manner. The suitably 
processed spectrum can then be obtained by the inverse transform. 

Several Fourier transform pairs are shown pictorially in Figure 11. Ап 
infinitely sharp amplitude-time signal, Figure 11(a), has a frequency response 
spectrum containing equal amplitudes at all frequencies. This is the white 
spectrum characteristic of a random noise amplitude-time signal. As the signal 
becomes broader, the frequency spectrum gets narrower. The higher frequen- 
cies are reduced dramatically and the frequency spectrum has the form 
(sin x)/x, called the sinc function, Figure 11(b). For a triangular signal, Figure 
11(c), the functional form of the frequency spectrum is (sin? x)/x?, the sinc? 
function. The sinc and sinc? forms are common filtering functions in inter- 
ferometry, where their application is termed apodisation. The frequency 
response spectra of Lorentzian and Gaussian shaped signals are of particular 
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Figure 11 Some well characterized Fourier pairs. The white spectrum and the impulse 
function (a), the boxcar and sinc functions (b), the triangular and sinc? func- 
tions (c), and the Gaussian pair (d) 
(Reproduced by permission from ref. 7) 


interest since these shapes describe typical spectral profiles. The Fourier trans- 
form of a Gaussian signal is another Gaussian form, and for a Lorentzian signal 
the transform takes the shape of an exponentially decaying oscillator. 

One of the earliest applications of the Fourier transform in spectroscopy was 
in filtering and noise reduction. This technique is still extensively employed. 

Figure 12 presents the Fourier transform of an infrared spectrum, before and 
after applying the 13-point quadratic Savitzky-Golay function. The effect of 
smoothing can clearly be seen as reducing the high-frequency fluctuations, 
hopefully due to noise, by the polynomial function serving as a low-pass filter. 
Convolution provides an important technique for smoothing and processing 
spectral data, and can be undertaken in the frequency domain by simple 
multiplication. Thus smoothing can be accomplished in the frequency domain, 
following Fourier transformation of the data, by multiplying the Fourier 
transform by a rectangular or other truncating function. The low-frequency 


7 R. Bracewell, “Тһе Fourier Transform and Its Application', McGraw-Hill, New York, USA, 
1965. 
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(а) 


(b) 


(с) 


Figure 12 A spectrum (а) and its Fourier transform before (b) and after applying а 
13-point quadratic smoothing filter (c) 


Fourier coefficients should be relatively unaffected, whereas the high-frequency 
components characterizing random noise are reduced or zeroed. The sub- 
sequent inverse transform then yields the smoothed waveform. 

The rectangular window function is a simple truncating function which can 
be applied to transformed data. This function has zero values above some 
pre-selected cut-off frequency, f., and unit values at lower frequencies. Using 
various cut-off frequencies for the truncating function and applying the inverse 
transform results in the smoothed spectra shown in Figure 13. 

Although the selection of an appropriate cut-off frequency value is somewhat 
arbitrary, various methods of calculating a suitable value have been proposed 
in the literature. The method of Lam and Isenhour? is worth mentioning, not 
least because of the relative simplicity in calculating f.. The process relies on 
determining what is termed the equivalent width, EW, of the narrowest peak in 
the spectrum. For a Lorentzian band the equivalent width in the time domain is 
given by 


EW, = @1/2.77/2 (28) 
where о; уг is the full-width at half-maximum, the half-width, of the narrowest 


peak in the spectrum. 
The equivalent width in the frequency domain, EW, і is simply the reciprocal 


8 R.B. Lam and T.L. Isenhour, Anal. Chem., 1981, 53, 1179. 
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Figure 13 А spectrum and its Fourier transform (a). The transform and its inverse 
retaining (b) 40, (c) 20, and (d) 6 of the Fourier coefficients 


of EW,. Assuming the spectrum was acquired in a single scan taking 10 s and 
it comprises 256 discrete points, then the sampling interval, At, is given by 


At = 10/256 = 0.039 s (29) 
and the maximum frequency, fmax, by 
Snax = 1/(2А0) = 12.75 Hz (30) 


The IR spectrum was synthesized from two Lorentzian bands, the sharpest 
having @;/2 = 1.17 s. Therefore EW, = 1.838 s and EW, = 0.554 Hz. 

The complex interferogram of 256 points is composed of 128 real values and 
128 imaginary values spanning the range 0-12.75 Hz. According to the EW 
criterion, a suitable cut-off frequency is 0.554 Hz and the number of significant 
points, N, to be retained may be calculated from 


М = (128)(0.554)/12.75 = 6 (31) 
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Thus, points 7 to 128 are zeroed in both the real and imaginary arrays before 
performing the inverse transform, Figure 13(d). Obviously, to use the tech- 
nique, it is necessary to estimate the half-width of the narrowest band present. 
Where possible this is usually done using some sharp isolated band in the 
spectrum. 

АП the smoothing functions discussed in previous sections can be displayed 
and compared in the frequency domain, and in addition new filters can be 
designed. Bromba and Ziegler have made an extensive study of such ‘designer’ 
filters.?'? The Savitzky-Golay filter acts as a low-pass filter that is optimal for 
polynomial shaped signals. Of course, in spectrometry Gaussian or Lorentzian 
band shapes are the usual form and the polynomial is only an approximation to 
a section of the spectrum defined by the width of the filter window. There is no 
reason why filters other than the polynomial should not be employed for 
smoothing spectral data. Use of the Savitzky-Golay procedure is as much 
traditional as representing any theoretical optimum. 

Bromba and Ziegler have defined a general filter with weighting elements 
defined by the form 


2a +1 alj| 
pA -————— 32 
ДЕУ п(п + 1) (32) 
where о is the vector of coefficients, j = —n...n, and a is a shape parameter. 


The frequency-response curves of three 15-point filters with a = 0.5, 1, and 2 
are illustrated in Figure 14. The case of the filter constructed with a = 2 is of 
particular interest as its frequency response increases beyond zero frequency 
and then falls off rapidly. The effect of convoluting a spectrum with this 
function is apparently to enhance resolution. The practical use of such filters 
should be undertaken with care, however, and they are best used in an 
interactive mode when the user can visibly assess the effects before proceeding 
to further data manipulation. 

Whatever smoothing technique is employed, the aim is to reduce the effects 
of random variations superimposed on the analytically useful signal. This 
transform can be simply expressed as 


Spectrum (smoothed) = Spectrum (raw) — noise (33) 


Assuming all noise is removed then the result is the true spectrum. Conver- 
sely, from Equation (33), if the smoothed spectrum is subtracted from the 
original, raw data, then a noise spectrum is obtained. The distribution of this 
noise as a function of wavelength may provide information regarding the 
source of the noise in spectrometers. The procedure is analogous to the analysis 
of residuals in regression analysis and modelling. 


? M.U.A. Bromba and H. Ziegler, Anal. Chem., 1983, 55, 1299. 
10 M.U.A. Bromba and H. Ziegler, Anal. Chem., 1983, 55, 648. 
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Figure 14 The frequency response of filters of Bomba and Ziegler for a values of 0.5, 1.0, 
and 2.0 


6 Interpolation 


Not all analytical data can be recorded on a continuous basis; discrete measure- 
ments often have to be made and they may not be at regular time or space 
intervals. To predict intermediate values for a smooth graphic display, or to 
perform many mathematical manipulations, e.g. Savitzky-Golay smoothing, it 
is necessary to evaluate regularly spaced intermediate values. Such values are 
obtained by interpolation. 

Obviously, if the true underlying mathematical relationship between the 
independent and dependent variables is known then any value can be computed 
exactly. Unfortunately, this information is rarely available and any required 
interpolated data must be estimated. 

The data in Table 3, shown in Figure 15, consist of magnesium concentra- 
tions as determined from river water samples collected at various distances 
from the stream mouth. Because of the problems of accessibility to sampling 
sites, the samples were collected at irregular intervals along the stream channel 
and the distances between samples were calculated from aerial photographs. To 
produce regularly spaced data, all methods for interpolation assume that no 
discontinuity exists in the recorded data. It is also assumed that any inter- 
mediate, estimated value is dependent on neighbouring recorded values. The 
simplest interpolation technique is linear interpolation. With reference to Figure 
15, if y; and у» are observed values at points x, and x2, then the value of y' 
situated at x' between x, and x; can be calculated from 


Ж (у= у) - хі) 
дығы (x2 — хі) e 
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Table 3 Concentration of maganesium (mg Кр!) from a stream sampled at 
different locations along its course. Distances are from stream mouth to 
sample locations 


Distance (m) Mg (mg kg") 


1800 4.0 
2700 10.1 
4500 11.5 
5200 10.2 
7100 8.4 
8500 8.6 


For a value of x' of 2500 m the estimated magnesium concentration, y', is 
8.74 mg Кр. 

The difference between values of adjacent points is assumed to be linear 
function of the distance separating them. The closer a point is to an observa- 
tion, the closer its value is to that of the observation. Despite the simplicity of 
the calculation, linear interpolation should be used with care as the abrupt 
changes in slope that may occur at recorded values are unlikely to reflect 
accurately the more smooth transitions likely to be observed in practice. A 
better, and graphically more acceptable, result is achieved by fitting a smooth 
curve to the data. Suitable polynomials offer an excellent choice. 

Polynomial interpolation is simply an extension of the linear method. The 
polynomial is formed by adding extra terms to the model to represent curved 
regions of the spectrum and using extra data values in the model. 

If only one pair of measurements had been made, say ( y,, x1), then a zeroth 
order equation of the type y’ = y,, for all у would be the only possible solution. 
With two pairs of measurements, ( yi, xi) and (у, x2), then a first-order linear 
model can be proposed, 


y = уу tax — хі) (35) 
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Figure 15 Magnesium concentration as a function of distance from the stream source and 
the application of linear interpolation 
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where 


ау = (у — y9/66 — xi) (36) 
= 6.77 х 1072 for the magnesium data. 


This, of course, is the model of linear interpolation and for x' = 2500 m, 
y' = 8.74 mg kg '. 

To take account of more measured data, higher order polynomials can be 
employed. A quadratic model will fit three pairs of points, 


у = у, t ах’ — ху) + a(x’ — х) — x2) (37) 


with the quadratic term being zero when x’ = x; or x’ = x2. When x’ = x; then 
substitution and rearrangement of Equation (37) allows the coefficient а, to be 
' calculated, 


"TP Qum, ais = xi x9) (38) 


and 


(эз — у) у) _ (2 — (Оо-у) 
_ бз x). (x2 — хі) 


(хз — x2) Өз) 


= — 2.2 x 1076, for the magnesium data. 


Substituting for a, and x' — 2500 m into Equation (37), the estimated value 
of y' is 9.05 mg Кө”! Mg. 

The technique can be extended further. With four pairs of observations, a 
cubic equation can be generated to pass through each point, 


у = yi + ах — xi) + a(x’ — xi)(x' — x2) та — x)(x — хХх - хз) (40) 


and by a similar process, at ха the coefficient a; is given by 


(а-у) Оо-у) (з-у) (»- у) 
(4—x) (о-х) (з- x) (х- хі) 
(ха - х2) (хз - x2) 

(ха - хз) 


= — 4.28 x 10719 for the magnesium data. 


a= 


(41) 


and substituting into Equation (40), for x’ = 2500 m, then у’ = 8.93 mg Кр! 


Mg. 
As the number of observed points to be connected increases, then so too does 
the degree of the polynomial required if we are to guarantee passing through 
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each point. The general technique is referred to as providing divided difference 
polynomials. The coefficients a2, a3, a4, etc. may be generated algorithmically 
by the ‘Newton forward formula’, and many examples of the algorithms are 
available.!!!2 

To fit a curve to n data points a polynomial of degree (n — 1) is required, and 
with a large data set the number of coefficients to be calculated is correspond- 
ingly large. Thus 100 data points could be interpolated using a 99-degree 
polynomial. Polynomials of such a high degree, however, are unstable. They 
can fluctuate wildly with the high-degree terms forcing an exact fit to the data. 
Low-degree polynomials are much easier to work with analytically and they are 
widely used for curve fitting, modelling, and producing graphic output. To fit 
small polynomials to an extensive set of data it is necessary to abandon the idea 
of trying to force a single polynomial through all the points. Instead different 
polynomials are used to connect different segments of points, piecing each 
section smoothly together. One technique exploiting this principle is spline 
interpolation, and its use is analogous to using a mechanical flexicurve to draw 
manually a smooth curve through fixed points. 

The shape described by a spline between two adjacent points, or knots, is a 
cubic, third-degree polynomial. For the six pairs of data points representing 
our magnesium study, we would consider the curve connecting the data to 
comprise five cubic polynomials. Each of these take the form 


зх) = а? + bjx? + cx + dy i=1...5 (42) 


To compute the spline, we must calculate values for the 20 coefficients, four 
for each polynomial segment. Therefore we require 20 simultaneous equations, 
dictated by the following physical constraints imposed on the curve. 

Since the curve must touch each point then 


s) = уь pcd 
(хл) = у i21...5 (43) 


The spline must curve smoothly about each point with no sharp bends ог 
kinks, so the slope of each segment where they connect must be similar. To 
achieve this the first derivatives of the spline polynomials must be equal at the 
measured points. 


ds; i(xi) ds;(x;) 
E Se йш Des 44 
dx dx (44) 

We can also demand that the second derivatives of each segment will be 
similar at the knots. 


п А.Е. Carley and P.H. Morgan, ‘Computational Methods in the Chemical Sciences’, Ellis 
Horwood, Chichester, UK, 1989. 
12 P. Gans, ‘Data Fitting in the Chemical Sciences’, J. Wiley and Sons, Chichester, UK, 1992. 
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425,-1(х:) ds) 
d ағ 





n ee (45) 


Finally, we can specify that at the extreme ends of the curve the second 
derivatives are zero: 


d? 5,(x1) =0 
ах? 
(46) 
42 55(хє) =0 
dx 


From Equations (43) to (46) we can derive our 20 simultaneous equations 
and, by suitable rearrangement and substitution of values for x and y, deter- 
mine the values of the 20 coefficients а, b; c; and d, i=1...5. 

This calculation is obviously laborious and the same spline can be computed 
more efficiently by suitable scaling and substitution in the equations.!! If the 
value of the second derivative of the spline at x; is represented by р;, 


КЕЛЕЛІ ?sQu) 
pe 8-45 2142,5 (47) 


then if the values of p, . . . ps were known, all the coefficients, a, b, c, d, could be 
computed from the following four equations, 


si(xi) = yi 
SAXi41) = Visi 
а(х) = pi 
dx? 
а254х; 
а) р (48) 
If each spline segment is scaled on the x-axis between the limits [0,1], using 
the term t = (x — x))/(x;+1 — x), then the curve can be expressed as'' 


st) = у t (0 — yit (Xii ы pisi - [0 — 0%-(1- 21р} (49) 


To calculate the values of p; we impose the constraint that the first deriva- 
tives of the spline segments are equal at their endpoints. The resulting equations 
are 


уро t U2p3 = W2 
U2 P2 + Узрз + изра = W3 
изрз + уара + Ug Ds = Wa 
Ups + У5р5 = Ws (50) 
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or іп matrix form, 


y, U2 0 0 P2 W2 
ш Y, us 0 рэ | _ | из 51 
0 wuz у, u, |` lp Wa 61) 
0 0 wu v Ps Ws 


where 


ui = Xi+1 Хр y; = 2(Xi1 — Xi-1) 


—c[ Qua») (п-н) 
“= «(qu =x) Gi 2] (%) 


Equation (51) can be solved for p; by conventional elimination methods. 

Once the р; values have been computed, the value of t for any segment can be 
calculated. From this the spline, s(x), can be determined using Equation (49) 
and the appropriate values for p, and p;+1. 

For the magnesium in river water data, after scaling the distance data to km, 
we have, 


54 18 00 00] [p] [36 
18 50 07 00| [| | -16 
00 07 52 19]`| р] |55 63) 


00 00 19 6.6 ps 6.5 


with the result p; = — 6.314, рз = — 1.058, p4 = 0.939, and p; = 0.714. 
To estimate the magnesium concentration at a distance x — 2.5 km, a value 
of t is calculated, 





t= (x! — х1) — ху) = 0.778 (54) 
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Figure 16 The result of applying a cubic spline interpolation model to the stream magnes- 
ium data 
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and this, with values for p, = 0 and p; = — 6.314, is substituted into Equation 
(49), 


(Ð = ty, + (1— ду + (0 — x)X? — )р›/6 (55) 
51(4) = 9.008 mg kg^! Mg 


The resultant cubic spline curve for the complete range of the magnesium 
data is illustrated in Figure 16. 

Spline curve fitting has many important applications in analytical science, 
not only in interpolation but also in differentiation and calibration. The 
technique is particularly useful when no analytical model of the data is avail- 
able.'? 

Having acquired our chemical data, it is now necessary to analyse the results 
and extract the required relevant information. This will obviously depend on 
the aims of the analysis, but further preprocessing and manipulation of the data 
may be needed. This is considered in the next chapter. 


СНАРТЕК 3 


Feature Selection and Extraction 


1 Introduction 


Previous chapters have largely been concerned with processes related to acquir- 
ing our analytical data in a digital form suitable for further manipulation and 
analysis. This data analysis may include calibration, modelling, and pattern 
recognition. Many of these procedures are based on multivariate numerical 
data processing and before the methods can be successfully applied it is usual to 
perform some pre-processing on the data. There are three main aims of this 
pre-processing stage in data analysis, 


(a) to reduce the amount of data and eliminate data that are irrelevant to the 
study being undertaken, 

(b) to preserve or enhance sufficient information within the data in order to 
achieve the desired goal, 

(c) toextract the information in, or transform the data to, a form suitable for 
further analysis. 


One of the most common forms of pre-processing spectral data is normali- 
zation. At its simplest this may involve no more than scaling each spectrum in a 
collection so that the most intense band in each spectrum is some constant 
value. Alternatively, spectra could be normalized to constant area under the 
curve of the absorption or emission profile. A more sophisticated procedure 
involves constructing a covariance matrix between variates and extracting the 
eigenvectors and eigenvalues. Eigen analysis yields a set of new variables which 
are linear combinations of the original variables. This can often lead to 
representing the original information in fewer new variables, thus reducing the 
dimensionality of the data and aiding subsequent analysis. 

The success of pattern recognition techniques can frequently be enhanced or 
simplified by suitable prior treatment of the analytical data, and feature selec- 
tion and feature extraction are important stages in chemometrics. Feature 
selection refers to identifying and selecting those features present іп the analy- ` 
tical data which are believed to be important to the success of calibration or 
pattern recognition. Techniques commonly used include differentiation, inte- 
gration, and peak identification. Feature extraction, on the other hand, changes 
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the dimensionality of the data and generally refers to processes combining or 
transforming original variables to provide new and better variables. Methods 
widely used include Fourier transformation and principal components analysis. 
In this chapter the popular techniques pertinent to feature selection and 
extraction are introduced and developed. Their application is illustrated with 
reference to spectrochemical analysis. 


2 Differentiation 


Derivative spectroscopy provides a means for presenting spectral data in a 
potentially more useful form than the zero'th order, normal data. The tech- 
nique has been used for many years in many branches of analytical spectro- 
scopy. Derivative spectra are usually obtained by differentiating the recorded 
signal with respect to wavelength as the spectrum is scanned. Whereas early 
applications mainly relied on hard-wired units for electronic differentiation, 
modern derivative spectroscopy is normally accomplished computationally 
using mathematical functions. First-, second-, and higher-order derivatives can 
easily be generated. 

Analytical applications of derivative spectroscopy are numerous and gen- 
erally owe their popularity to the apparent higher resolution of the differential 
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Figure 1 А pair of overlapping Gaussian peaks (a), and the first- (b), second- (c), and 
third-order (d) derivative spectra 
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Figure 2 Quantitative analysis with first derivative spectra. Peak heights are displayed іп 
relative absorbance units 


data compared with the original spectrum. The effect can be illustrated with 
reference to the example shown in Figure 1. The zero'th-, first-, and second- 
order derivatives of a spectrum, comprised of the sum of two overlapping 
Gaussian peaks, are presented. The presence of a smaller analyte peak can be 
much more evident in the derivative spectra. In addition, for determining the 
intensity of the smaller peak in the presence of the large neighbouring peak, 
derivative spectra can be more useful and may be subject to less error. This is 
illustrated in Figure 2, in which the zero'th and first derivative spectra are 
shown for an analyte band with and without the presence of an overlapping 
band. If we assign unit peak height to the analyte in the normal, zero'th-order, 
spectrum, then for the same band with the interfering band present, a peak 
height of 55 units is recorded. Using a tangent baseline in order to attempt to 
correct for the overlap fails as there is no unique or easily identified tangent, 
and a not unreasonable value of 12 units for the peak height could be recorded, 
a 20% error. The situation is improved considerably if the first derivative 
spectrum is analysed. A value of one is assigned to the peak-to-peak distance of 
the lone analyte spectrum. In the presence of the overlapping band a similar 
measure for the analyte is now 1.04, a 4% error. 

This example, however, oversimplifies the case of using derivative spectro- 
scopy as it gives no indication of the effects of noise on the results. Derivative 
spectra tend to emphasize changes in slope that are difficult to detect in the 
zero'th-order spectrum. Unfortunately, as we have seen in previous chapters, 
noise is often comprised of high-frequency components and thus may be greatly 
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amplified by differentiation. It is the presence of noise which generally limits the 
use of derivative spectroscopy to UV-visible spectrometry and other techniques 
in which a high signal-to-noise ratio may be obtained for a spectrum. 

Various mathematical procedures may be employed to differentiate spectral 
data. We will assume that such data are recorded at evenly spaced intervals 
along the wavelength, А, or other x-axis. If this is not the case, the data may be 
interpolated to provide this. The simplest method to produce the first-deriva- 
tive spectrum is by difference, 


dy _ Yini Tyi 
di А (1) 
or, 
dy _ Ji*c1 Уі-і 
dà 24 Q) 
and for the second derivative, 
Фу Jyui-2yo- yii 
d? AM Ө) 


where y represents the spectral intensity, the absorbance, or other metric. 

Various other methods have been proposed to compute derivatives, includ- 
ing the use of suitable polynomial derivatives as suggested by Savitzky and 
Golay.!? The use of a suitable array of weighting coefficients as a smoothing 
function with which to convolute spectral data was described іп Chapter 2. In a 
similar manner, an array can be specified which on convolution produces the 
first, or higher degree differential spectrum. Using a quadratic polynomial and 
a five-point moving window, the first derivative is given by 


d 1 
di = ТОЛА (= 2yi-2 7 yi-1 t Yi+ı + 2yi«2) (4) 


and for the second derivative, 


dy 1 
dX Е TAM Qyi-2 — Yi-1 — 23i Унна + 2yi+2) (5) 


Equations (4) and (5) are similar to Equations (2) and (3). The difference is in 
the use of additional terms using extra points from the data in order to provide 
a better approximation. 

The relative merits of these different methods can be compared by differen- 


1 А, Savitzky and M J.E. Golay, Anal. Chem., 1964, 36, 1627. 
2 P. Gans, ‘Data Fitting in the Chemical Sciences’, J. Wiley and Sons, Chichester, UK, 1992. 
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Table 1 Derivative model data, y = x + х2/2 + noise 


x 0 1 2 3 4 

Noise 0.200 0.000 0.200 — 0.200 0.200 

y 0.000 1.500 4.000 7.500 12.000 

y + Noise 0.200 1.500 4.200 7.300 12.200 Data 1 
y + (Noise/2) 0.100 1.500 4.100 7.400 12.100 Data2 
y * (Noise/4) 0.050 1.500 4.050 7.450 12.050 Data 3 
y * (Noise/8) 0.025 1.500 4.025 7.415 12.005 Data4 


Table2 Derivatives of у= x + x^/2 + noise (from Table 1) determined by 


difference formulae 
Data 

1 2 3 4 
dy/dÀ by Equation (1) 3.10r2.7  3.3o0r2.66 3.40r2.55 3.45 or 2.525 
dy/dÀ by Equation 2) 2.9 2.95 2.98 2.9875 
dy/dÀ by Equation (4) 2.980 2.990 2.995 2.9975 
d?y/dM by Equation (3) 0.4 0.7 0.85 0.925 
d'y/dÀ? Бу Equation (5) 1.0857 1.0429 1.0214 1.0107 


tiating a known mathematical function. The model we will use is у = (x + х2/2) 
at the point x — 2. Various levels of noise are imposed on the signal y, as shown 
in Table 1. The resulting derivatives are shown in Table 2. As the noise level 
reduces and tends to zero, the derivative results from applying the five-point 
polynomial converge more quickly towards the correct noise-free value of 3 for 
the first derivative, and 1 for the second derivative. As with polynomial 
smoothing, the Savitzky-Golay differentiation technique is available with 
many commercial spectrometers. 

Just as smoothing can be undertaken in the frequency domain (as discussed 
in Chapter 2), so too can differentiation following Fourier transformation of 
the amplitude-time spectrum. Obtaining a derivative in the Fourier domain is 
quite simple and is achieved by multiplying the transform by a linear function.? 
The effect is illustrated in Figure 3. The original spectrum, comprising two 
overlapping bands and random noise, is first converted by Fourier trans- 
formation to the frequency domain. The transformed data are multiplied by the 
filter function shown in Figure 3(c), and the result transformed back to produce 
the first-derivative spectrum, Figure 3(d). The susceptibility of differentiation 
to high-frequency noise is clearly demonstrated in this example. The high- 
frequency components present in the Fourier domain data are heavily weighted 
by the filter function compared with low-frequency data, and the signal-to- 


3 R. Bracewell, “Тһе Fourier Transform and its Applications’, McGraw-Hill, New York, USA, 
1965 
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Figure3 Differentiation of spectra via the Fourier transform: (a) the original spectrum; 
(b) its Fourier transform; (c) the differential filter applied to the transform; (d) 
the resulting first derivative spectrum from the inverse transform 


noise ratio of the differential spectrum is severely degraded. This problem can 
be partly alleviated by combining a smoothing function along with the differen- 
tial filter. In Figure 4(a), the differential transform is truncated and applied to _ 
low frequencies only. High frequencies are eliminated by the zero weighting of 
the function. The result of multiplying our transformed data by this new filter 
function is shown in Figure 4(b) and the resultant first-derivative spectrum in 
Figure 4(c). The effect of the extra smoothing function is evident if Figure 4(c) 
and Figure 3(d) are compared. 

For many applications the digitization of a full spectrum provides far more 
data than is warranted by the spectrum's information content. Ап infrared 
spectrum, for example, is characterized as much by regions of no absorption as 
regions containing absorption bands, and most IR spectra can be reduced to a 
list of some 20—50 peaks. This represents such a dramatic decrease in dimen- 
sionality of the spectral data that it is not surprising that peak tables are 
commonly employed to describe spectra. The determination of spectral peak 
positions from digital data is relatively straightforward and the facility is 
offered on many commercial spectrometers. Probably the most common tech- 
niques for finding peak positions involve analysis of derivative data. 

In Figure 5 a single Lorentzian function is illustrated along with its first, 
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Figure 4 Combining smoothing and differentiating in the frequency domain: (a) the 
truncation filter to remove high frequency, noise signals and provide the first 
derivative; (b) the transform of the spectrum form Figure 3(a) after application 
of the filter; (c) the resulting first derivative spectrum from the inverse transform 


second, third, and fourth derivatives with respect to energy. At peak positions 
the following conditions exist, 


0 y «0 
-0  y"»0 (6) 


where y’ is the first derivative, y" the second, and so on. 
Thus, the presence and location of a peak in a spectrum can be ascertained 
from a suitable subset of the rules expressed mathematically in Equation (6):* 
Rule 1, a peak centre has been located if the first derivative value is zero and 
the second derivative value is negative, i.e. 


(у= 0) AND (у”<0), ог 


4 J.R. Morrey, Anal, Chem., 1968, 40, 905. 
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Figure 5 А Lorentzian band (a), and its first (b), second (c), third (d), and fourth (e) 
derivatives 


Rule 2, a peak centre has been located if the third derivative is zero and the 
fourth derivative is positive, i.e. 


(y"=0) AND (у”>0) 


Whereas Rule 2 is influenced less by adjacent, overlapping bands than Rule 
l, it is affected more by noise in the data. In practice some form of Rule 1 is 
generally used. A peak-finding algorithm may take the following form: 

Step 1: Convolute the spectrum with a suitable quadratic differentiating 
function until the computed central value changes sign. 

Step 2: At this point of inflection compute a cubic, least-squares function. By 
numerical differentiation of the resultant equation determine the true position 
of zero slope (the peak position). 

With any such algorithm it is necessary to specify some tolerance value below 
which any peaks are assumed to arise from noise in the data. The choice of 
window width for the quadratic differentiating function and the number of 
points about the observed inflection to fit the cubic model are selected by the 
user. These factors depend on the resolution of the recorded spectrum and the 
shape of the bands present. Results using a 15-point quadratic differentiating 
convolution function and a nine-point cubic fitting equation are illustrated in 
Figure 6. 
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Figure 6 Results of a peak picking algorithm. At x = 80, the first derivative spectrum 
crosses zero and the second derivative is negative. A 9-point cubic least-squares 
fit is applied about this point to derive the coefficients of the cubic model. The 
peak position (dy/dx = 0) is calculated as occurring at x = 80.3 


3 Integration 


Mathematically, integration is complementary to differentiation and comput- 
ing the integral of a function is a fundamental operation in data processing. It 
occurs frequently in analytical science in terms of determining the area under a 
curve, e.g. the integrated absorbance of a transient signal from a graphite 
furnace atomic absorption spectrometer. Many classic algorithms exist for 
approximating the area under a curve. We will briefly examine the more 
common with reference to the absorption profile illustrated in Figure 7. This 
envelope was generated from the model y = (0.1x? — 1.1x? + 3x + 0.2). Its 
integral, between the limits x = 0 and x = 6, can be computed directly. The area 
under the curve is 8.400. 

One of the simplest integration techniques to implement on a computer is the 
method of summing rectangles that each fit a portion of the curve, Figure 8(a). 
For N + 1 points in the interval xi, x2... Xy+ 1, we have N rectangles of width 
(xi 4.1 — xj) and height, hmi, given by the value of the curve at the mid-point 
between x; and x;, ;. The approximate area under the curve, A, between x, and 
Xy41 is therefore given by 
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Figure 7 The model absorption profile from a graphite furnace AAS study 
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Figure 8 The area under the AAS profile using (a) rectangular and (b) trapezoidal 
integration 


A= (Xi+1 — XD Ami (7) 


Me 


i=l 


As N gets larger, the width of each rectangle becomes smaller and the answer 
is more accurate: 


for М-5, A=8.544 
М-10, А-8.436 
М-15, А-8.388 
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A second method of approximating the integral is to divide the area under 
the curve into trapezoids, Figure 8(b). The area of each trapezoid is given by 
one-half the product of the width (x;,, — х;) and the sum of the two sides, h; 
and h;+ı. The area under the curve сап be calculated from 


Ат 2 (Ох+1 х) + hi+1)/2 (8) 


For our absorption peak, the trapezoid method using different widths pro- 
duces the following estimates for the integral: 


for N=5, А= 8.112 
М = 10, А = 8.328 
М= 15, А = 8.368 


In general the trapezoid method is inferior to the rectangular method. 
A more accurate method can be achieved by combining the rectangular and 
trapezoid methods into the technique referred to as Simpson's method,’ 


A= Ў (Xi41 — X)(4hmi + hi + his ,)/6 (9) 


For ош absorption profile this gives 


for N=5, А = 8.400 
N=10, 4 = 8.400 
М№= 15, А = 8.400 


4 Combining Variables 


Many analytical measures cannot be represented as a time-series in the form of 
a spectrum, but are comprised of discrete measurements, e.g. compositional or 
trace analysis. Data reduction can still play an important role in such cases. The 
interpretation of many multivariate problems can be simplified by considering 
not only the original variables but also /inear combinations of them. That is, a 
new set of variables can be constructed each of which contains a sum of the 
original variables each suitably weighted. These linear combinations can be 
derived on an ad hoc basis or more formally using established mathematical 
techniques. Whatever the method used, however, the aim is to reduce the 
number of variables considered in subsequent analysis and obtain an improved 
representation of the original data. The number of variables measured is not 
reduced. 

An important and commonly used procedure which generally satisfies these 


5 А.Е. Carley and P.H. Morgan, ‘Computational Methods in the Chemical Sciences’, Ellis 
Horwood, Chichester, UK, 1989. 
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criteria is principal components analysis. Before this specific topic is examined 
it is worthwhile discussing some of the more general features associated with 
linear combinations of variables. 


Linear Combinations of Variables 


In order to consider the effects and results of combining different measured 
variables the data set shown in Table 3 will be analysed. Table 3 lists the mean 
values, from 11 determinations, of the concentration of each of 13 trace metals 
from 17 different samples of heart tissue.* The data in Table 3 indicate that the 
trace metal composition of cardiac tissue derived from different anatomical 
sites varies widely. However, it is not immediately apparent by visual examin- 
ation of these raw data alone, what order, groups, or underlying patterns exist 
within the data. 

The correlation matrix for the 13 variables is shown in Table 4 and, as is 
usual in multivariate data, some pairs of variables are highly correlated. 
Consider, in the first instance, the concentrations of chromium and nickel. We 
shall label these variables ХІ and X2. These elements exhibit a mutual corre- 
lation of 0.90. А scatter plot of these data is illustrated in Figure 9. Also shown 
are projections of the points on to the ХІ and X2 concentration axes, providing 
one-dimensional frequency distributions (as bar graphs) of the variables ХІ and 
X2. It is evident from Figure 9 that a projection of the data points onto some 
other axis could provide this axis with a greater spread in terms of the frequency 
distribution. This single new variable or axis would contain more variance or 
potential information than either of the two original variables on their own. 
For example, a new variable, X3 could be identified which can be defined as the 
sum of the original variables, i.e. 


ХЗ = а.Х1 + b.X2 (10) 


and its value for the 17 samples calculated. The values of а and 5 could be 
chosen arbitrarily such that, for example, a — b. Then, this variable would 
describe a new axis at an angle of 45? with the axes of Figure 9. The sample 
points can be projected on to this as illustrated in Figure 10. 

As for the actual values of the coefficients a and b, the simplest case is 
described by a = b = 1, but any value will provide the same angle of projection 
and the same form of the distribution of data on this new line. In practice, it is 
usual to specify a particular linear combination referred as the normalized linear 
combination and defined by 


а? + = 1 (11) 


Normalization of the coefficients defining our new variable scales it to the 
range of values used to define the X1 and X2 axes of the original graph. In our 


6 W. Niedermeier, in ‘Applied Atomic Spectroscopy’, ed. E.L. Grove, Plenum Press, New York, 
USA, 1978, p. 219. 
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Figure 9 Chromium and nickel concentration scatter plot from heart tissue data. The 


distribution of concentration values for each element is shown as a bar graph on 
their respective axes 
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Figure 10 4 45° line on the Cr-Ni data plot with the individual sample points projected on 


to this line 
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example, this implies a = b = 1/ V2. The variance of ХЗ derived from substitut- 
ing a and b into Equation 10 for the concentration of chromium and nickel for 
each of the 17 samples is 5.22 compared with a? = 3.07 and о? = 2.43 for X1 and 
X2 respectively. Thus X3 does indeed contain more potential information than 
either X1 or X2. 

This reorganization or partitioning of variance associated with individual 
variates can be formally addressed as follows. 

For any linear combination of variables defining a new variable X given by 


X = ах + ах +... + a4 Xs (12) 


The variance, 5,2, of the new variable can be calculated from 


52 = Y > а.а, Covi, (13) 


у=1К=1 


which, from the definition of covariance, can be rewritten as 


n n n 
s2 = У аў.зў + > > 4.5.0.51. (14) 
j=l 


j=l кеуі 


where ғу is the correlation coefficient between variables x, and ху. 

It should be noted that for statistically independent variables, rj, = 0 and 
Equation (14) reduces to the more common equation stating that the variance 
of a sum of variables is equal to the sum of the variances for each variable. 

The calculated value for the variance of our new variable X3 confirms that 
there is an increased spread of the data on the new axis. As well as this 
algegbraic notation, it is worth pointing out that the coefficients of the normal- 
ized linear combination may be represented by the trigonometric identities 


а = cosa 
b = sina (15) 


where а is the angle between the projection of the new axis and the original 
ordinate axis. If a = 45°, then а=} = 1/ V2, the normalized coefficients as 
derived from Equation (11). This trigonometric relationship is often employed 
in determining different linear combinations of variables and is used in many 
principal component algorithms. 

Values of a, or a and b, employed in practice depend оп the aims of the data 
analysis. Different linear combinations of the same variables will produce new 
variables with different attributes which may be of interest in studying different 
problems. The linear combination which produces the greatest separation 
between two groups of data samples is appropriate in supervised pattern 


? B. Flury and Н. Riedwye, ‘Multivariate Statistics: A Practical Approach’, Chapman and Hall, 
London, UK, 1984. 
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recognition. This forms the basis of linear discriminant analysis, a topic that 
will be discussed in Chapter 5. Considering our samples or objects as a single 
group or cluster, we may wish to determine the minimum number of normalized 
linear combinations having the greatest proportion of the total variance, in 
order to reduce the dimensionality of the problem. This is the task of principal 
components analysis and is treated in the next section. 


Principal Components Analysis 


The aims of performing a principal components analysis (PCA) on multivariate 
data are basically two-fold. Firstly, PCA involves rotating and transforming 
the original, п, axes each representing an original variable into new axes. This 
transformation is performed in a way so that the new axes lie along the 
directions of maximum variance of the data with the constraint that the axes are 
orthogonal, i.e. the new variables are uncorrelated. It is usually the case that the 
number of new variables, p, needed to describe most of the sample data 
variance is less than n. Thus PCA affords a method and a technique to reduce 
the dimensionality of the parameter space. Secondly, PCA can reveal those 
variables, or combinations of variables, that determine some inherent structure 
in the data and these may be interpreted in chemical or physico-chemical terms. 

As in the previous section, we are interested in linear combinations of 
variables, with the goal of determining that combination which best summa- 
rizes the n-dimensional distribution of data. We are seeking the linear combin- 
ation with the largest variance, with normalized coefficients applied to the 
variables used in the linear combinations. This axis is the so-called first principal 
axis or first principal component. Once this is determined, then the search 
proceeds to find a second normalized linear combination that has most of the 
remaining variance and is uncorrelated with the first principal component. The 
procedure is continued, usually until all the principial components have been 
calculated. In this case, p — n and a selected subset of the principal components 
is then used for further analysis and for interpretation. 

Before proceeding to examine how principal components are calculated, it is 
worthwhile considering further a graphical interpretation of their structure and 
characteristics. From our heart tissue, trace metal data, the variance of 
chromium concentration is 3.07, the variance of nickel concentration is 2.43, 
and their covariance is 2.47. This variance-covariance structure is represented 
by the variance-covariance matrix, 


Xl ж 
Соух x2 = ХІ 3.07 2.47 (16) 
Х2 2.47 2.43 


As well as this matrix form, this structure сап also be represented diagram- 
matically as shown in Figure 11. The variance of the chromium data is 
represented by a line along the X1 axis with a length equal to the variance of X1. 


8 J.C. Davis, ‘Statistics in Data Analysis in Geology’, J. Wiley and Sons, New York, USA, 1973. 
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Figure 11 А bivariate variance-covariance matrix may be displayed graphically 


Since the concentration of chromium is correlated with the concentration of 
nickel, ХІ values vary with variable X2 axis. The length of this line is equal to 
the covariance of X1 with X2 and represents the degree of interaction or 
colinearity between the variables. In a similar manner, the variance and 
covariance of X2 are drawn along and from the second axis. For a square 
(2 x 2) matrix, these elements of the уагіапсе-соуагіапсе matrix lie on the 
boundary of an ellipse, the centre of which is the origin of the co-ordinate 
system. The slope of the major axis is the eigenvector associated with the first 
principal component, and its corresponding eigenvalue is the length of this 
major axis, Figure 12. The second principal component is defined by the second 
eigenvector and eigenvalue. It is represented by the minor axis of the ellipse and 
is orthogonal, 90°, to the first principal component. For a 3 x 3 variance— 
covariance matrix the elements lie on the surface of a three-dimensional 
ellipsoid. For larger matrices still, higher-dimensional elliptical shapes apply 
and can only be imagined. Fortunately the mathematical operations deriving 
and defining these components remain the same whatever the dimensionality. 

Thus, principal components сап be defined as the eigenvectors of a уагіапсе- 
covariance matrix. They provide the direction of new axes (new variables) on to 
which data can be projected. The size, or length, of these new axes containing 
our projected data is proportional to the variance of the new variable. 

How do we calculate these eigenvectors and eigenvalues? In practice the 
calculations are always performed on a computer and there are many algo- 
rithms published in mathematical and chemometric texts. For our purposes, in 
order to illustrate their derivation, we will limit ourselves to bivariate data and 
calculate the eigenvectors manually. The procedure adopted largely follows 
that of Davis? and Healy.? 


9 M.J.R. Healy, ‘Matrices for Statistics’, Oxford University Press, Oxford, UK, 1986. 
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Figure 12 The elements of a bivariate variance—covariance matrix lie on the boundary 
defined by an ellipse. The major axis of the ellipse represents the first principal 
component, and its minor axis the second principal component 

Consider a set of.simultaneous equations, expressed in matrix notation, 

А.х — €x (17) 
which simply states that matrix 4, multiplied by vector x, is equal to some 
constant, the eigenvalue €, multiplied by x. To determine these eigenvalues, 
Equation (17) can be rewritten as 

A.x - ёх = 0 (18) 
ог 


(4-6Мх-0 (19) 


where J is the identity matrix, which for а 2 x 2 matrix is 
_}1 0 
[= р j (20) 
If x is not 0 then the determinant of the coefficient matrix must be zero, i.e. 


l4 - €1| 20 (21) 


For our experimental data with X1 and X2 representing chromium and nickel 
concentrations, and А = Сор, then 
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542-460 Соу 


|Cov T 41 = Соу 5,22 = # 


-0 (22) 








where Cov is the covariance between X1 and X2. Expanding Equation (22) gives 
the quadratic equation, 


(812 — (8,22 — €) – Соу 20 (23) 
and substituting the values for our Cr and Ni data, 
(3.07 — €(2.43 — £) – 2.47 =0 (24) 
which simplifies to 
€ — 5.52 + 1.36 = 0 (25) 


This is a simple quadratic equation providing two characteristic roots or 
eigenvalues, viz. ё = 5.24 and €, = 0.26. 

As a check in our calculations, the sum of the eigenvalues should be equal to 
the sum of diagonal elements, the trace, of the original matrix (ie. 
3.07 + 2.43 = 5.24 + 0.26). 

Associated with each eigenvalue (the length of the new axis in our geometric 
model) is a characteristic vector, the eigenvector, у = Гу), v;] defining the slope 
of the axis. Our eigenvalues, ¢, were defined as arising from a set of simul- 
taneous equations, Equation (19), which can now be expressed, for a 2x2 


matrix, as, 
Au — ё А2 Xi| |0 
| An as ЫН М0) 


and the elements of x аге the eigenvectors associated with the first eigenvalue, 
6. For ош 2 х 2, Ni-Cr variance-covariance data, substitution into (26) leads 
to 


$a?^—4€44 Соу vl. 

| Соу 52 = Я i ЕН ар (27) 
2 

544-600 Соу 21 

| Соу sx- a ' ie "9 Q8) 


with v1; and v1; as the eigenvectors associated with the first eigenvalue, and v2, 
and v2, defining the slope of the second eigenvalue. 
Solving these equations gives 


v1 = [0.751,0.660] Q9) 
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which defines the slope of the major axis of the ellipse (Figure 12), and 


v2 = [ – 0.660,0.751] (30) 


which is perpendicular to v1 and is the slope of the ellipse's minor axis. 

Having determined the orthogonal axes or principal components of our 
bivariate data, it remains to undertake the projections of thé data points on to 
the new axes. For the first principal components PCI, 


PCI; = 0.751X1; + 0.660X2, (31) 
and for the second principal component PC2, 
PC2; = — 0.660X1; + 0.751X2; (32) 


Thus, the elements of the eigenvectors become the required coefficients for 
the original variables, and are referred to ав loadings. The individual elements 
of the new variables (РСІ and PC2) are derived from ХІ and X2 and are termed 
the scores.'9!! The pringipal components scores for the chromium and nickel 
data are given in Table I 

The total variance of. the original nickel and chromium data is 
3.07 + 2.43 = 5.5 with X1 contributing 56% of the variance and X2 contribut- 


Table 5 The PC scores for chromium and nickel concentrations 


Sample PCI PC2 


AO 2.91 — 0.17 
MPA 2.02 0.48 
RSCV 4.82 0.94 
TV 4.03 0.04 
MV 3.89 1.24 
PV 11.25 1.28 
AV 7.71 0.14 
КА 3.50 0.65 
LAA 2.37 1.64 
RV 2.95 0.33 
LV 3.12 0.98 
LV-PM 3.06 0.76 
IVS 2.67 1.37 
CR 3.04 1.18 
SN 3.19 1.05 
AVN +В 2.62 1.02 


LBB 2.97 1.11 


19 B.F.J. Manly, ‘Multivariate Statistical Methods: А Primer’, Chapman and Hall, London, UK, 
1986. 
п R.E. Aries, D.P. Lidiard, and R.A. Spragg, Chem. Br., 1991, 27, 821. 
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ing the remaining 44%. The calculated eigenvalues are the lengths of the two 
principal axes and represent the variance associated with each new variable, 
РСІ and PC2. The first principal component, therefore, contains 5.24/5.50 or 
more than 95% of the total variance, and the second principal component less 
than 596, 0.26/5.50. If it were necessary to reduce the display of our original 
bivariate data to a single dimension using only one variable, say chromium 
concentration, then a loss of 44% of the total variance would ensue. Using the 
first principal component, however, and optimally combining the two vari- 
ables, only 5% of the total variance would be missing. 

We are now in a position to return to the complete set of trace Sentit data in 
Table 3 and apply principal components analysis to the full data matrix. The 
techniques described and used in the above example to extract and determine 
the eigenvalues and eigenvectors for two variables can be extended to the more 
general, multivariate case but the procedure becomes increasingly difficult and 
arithmetically tedious with large matrices. Instead, the eigenvalues are usually 
found by matrix manipulation and iterative approximation methods using 
appropriate computer software. Before such an analysis is undertaken, the 
question of whether to transform the original data should be considered. 
Examination of Table 3 indicates that the variates considered have widely 
differing means and standard deviations. Rather than standardizing the data, 
since they are all recorded in the same units, one other useful transformation is 
to take logarithms of the values. The result of this transformation is to scale all 
the data to a more similar range and reduce the relative effects of the more 
concentrated metals. Having performed the log-transformation on our data, 
the results of performing PCA on all 13 for the 17 samples are as given in 
Table 6. 

According to the eigenvalue results present in Table 6(b), and displayed in 
the scree plot of Figure 13, over 84% of the total variance in the original data 


Eigenvalues 
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Figure 13 An eigenvalue, scree plot for the heart-tissue trace metal data 
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Figure 14 Scatter plot of the 17 heart-tissue samples on the standardized first two prin- 
cipal components from the trace metal data 


can be accounted for by the first two principal components. The transformation 
of the 13 original variables to two new linear combinations represents consider- 
able reduction of the data presented whilst retaining much of the original 
information. A scatter plot of the first two principal components scores is 
shown in Figure 14 and patterns to the samples according to the distribution of 
the trace metals in the data are evident. Three tissues, the pulmonary valve, 
aortic valve, and the right superior vena cava, constitute unique groups of one 
tissue each, well distinguished from the rest. The aorta, main pulmonary artery, 
mitral, and tricuspid valves constitute a cluster of four tissue types. Finally, 
there is a group of ten tissues derived from the myocardium. А more detailed 
analysis and discussion of this data set is presented by Niedermeier.® 

As well as being used with discrete analytical data, such as the trace metal 
concentrations discussed above, principal components analysis has been exten- 
sively employed on digitized spectral profiles.!? A simple example will illustrate 
the basis of these applications. Infrared spectra of 21 samples of acrylic, PVC, 
styrene, and nylon polymers, as thin films, were recorded in the range 
4000-600 cm '. Each spectrum was normalized on the most intense absorption 
band to remove film thickness effects, and reduced to 216 discrete values by 
signal averaging. The resulting 21 x 216 data matrix was subject to principal 
components analysis. The resulting eigenvalues are illustrated in the scree plot 
of Figure 15, and the first three principal components account for more than 
91% of the total variance in the original spectra. А scatter plot of the polymer 
data loaded on to these three components is shown in Figure 16. It is evident 
from this plot that these three components are sufficient to provide effective 


12 LA. Cowe and J.W. McNicol, Appl. Spectrosc., 1985, 39, 257. 
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Factor 


Figure 15 А scree plot for the eigenvalues derived from the IR spectra of 21 polymers 





A Nylon С PVC 
B Styrene D Acrylic 


Figure 16 — 4 three-dimensional scatter plot of the polymer spectra projected on to the first 
three principal components 


clustering of the samples with clear separation between the groups and types of 
polymer. The first component, РСІ, forms an axis which would allow the 
partitioning between acrylic polymer and other samples. PC2 provides for two 
partitions; both the nylon and PVC polymers are separated from the styrene 
and acrylic polymers. PC3 allows the separation between styrenes and others. 
Examination of the principal component loadings, the eigenvectors, as func- 
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tions of wavelength, i.e. spectra of loadings, highlights the weights given to each 
spectral point in each of the original spectra, Figure 17. It can be seen from 
these ‘spectra’ that where a partition between sample types if formed, the 
majority of absorption bands in the corresponding spectra receive strong 
positive or negative weighting. PC2 produces two partitions and, Figure 17(b), 
the bands in nylon spectra receive positive weightings and bands in PVC 
spectra, Figure 17(c), have negative weightings. 

The power of principal components analysis is in providing a mathematical 
transformation of our analytical data to a form with reduced dimensionality. 
From the results, the similarity and difference between objects and samples can 
often be better assessed and this makes the technique of prime importance in 
chemometrics. Having introduced the methodology and basics here, future 
chapters will consider the use of the technique as a data preprocessing tool. 


Factor Analysis 


The extraction of the eigenvectors from a symmetric data matrix forms the basis 
and starting point of many multivariate chemometric procedures. The way in 
which the data are preprocessed and scaled, and how the resulting vectors are 
treated, has produced a wide range of related and similar techniques. By far the 
most common is principal components analysis. As we have seen, PCA pro- 
vides n eigenvectors derived from a n X n dispersion matrix of variances and 
covariances, or correlations. If the data are standardized prior to eigenvector 
analysis, then the variance-covariance matrix becomes the correlation matrix 
[see Equation (25) in Chapter 1, with sı = s2]. Another technique, strongly 
related to PCA, is factor analysis.'? 

Factor analysis is the name given to eigen analysis of a data matrix with the 
intended aim of reducing the data set of n variables to a specified number, p, of 
fewer linear combination variables, or factors, with which to describe the data. 
Thus, p is selected to be less than n and, hopefully, the new data matrix will be 
more amenable to interpretation. The final interpretation of the meaning and 
significance of these new factors lies with the user and the context of the 
problem. 

A full description and derivation of the many factor analysis methods 
reported in the analytical literature is beyond the scope of this book. We will 
limit ourselves here to the general and underlying features associated with the 
technique. A more detailed account is provided by, for example, Hopke!^!? and 
оһетв. 6-19 


із D. Child, “Тһе Essentials of Factor Analysis', 2nd Edn, Cassel Educational, London, UK, 1990. 

14 P.K. Hopke, in ‘Methods of Environmental Data Analysis’, ed. C.N. Hewitt, Elsevier, Essex, 
UK, 1992. 

15 P.K. Hopke, Chemomet. Intell. Lab. Systems, 1989, 6, 7. 

16 E. Malinowski, “Ғасіог Analysis іп Chemistry', J. Wiley and Sons, New York, USA, 1991. 

17 T.P.E. Auf der Heyde, J. Chem. Ed., 1983, 7, 149. 

18 G.L. Ritter, S.R. Lowry, T.L. Isenhour, and C.L. Wilkins, Anal. Chem., 1976, 48, 591. 

1? E. Malinowski and M. McCue, Anal. Chem., 1977, 49, 284. 
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Figure 17 Еігепуесіогз of the polymer data displayed as a function of wavelength and 
compared with typical spectra: (a) the first principal component and a spectrum 
of an acrylic sample; (b) the second PC and a nylon spectrum; (c) the second 
PC and a PVC spectrum; and (d) the third PC and a polystyrene spectrum 
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The principal steps in performing a factor analysis аге, 


(a) preprocessing of the raw data matrix, 

(b) computing the symmetric matrix of covariances or correlations, i.e. the 
dispersion matrix, 

(c) extracting the eigenvalues and eigenvectors, 

(d) selecting the appropriate number of factors with which to describe the 
data, and 

(e) rotating these factors to provide a meaningful interpretation of the 
factors. 


Steps (a) to (c) are as for principal components analysis. However, as the final 
aim is usually to interpret the results of the analysis in terms of chemical or 
spectroscopic properties, the method adopted at each step should be selected 
with care and forethought. A simple example will serve to illustrate the prin- 
ciples of factor analysis and the application of some of the options available at 
each stage. 

Table 7 provides the digitized mass spectra of five cyclohexane/hexane 
mixtures, each recorded at 17 m/z values and normalized to the most intense, 
parent ion.” These spectra are illustrated іп Figure 18. Presented with these 
data, and in a 'real' situation not knowing the composition of the mixtures, 
our first task is to determine how many discrete components contribute to these 


Table 7 Normalized MS data for cyclohexane апа hexane mixtures 


E 26 Cyclohexane 
50 


90 80 20 10 

т/2 А В С D E 
27 13.79 19.05 20.80 28.30 24.55 
29 12.93 15.87 26.40 44.04 38.18 
39 17.24 17.46 20.00 20.02 18.18 
40 4.31 4.76 4.00 3.00 3.64 
41 55.17 63.49 14.40 91.09 80.00 
42 29.31 29.37 36.80 47.05 41.82 
43 21.55 26.19 49.60 74.07 71.82 
44 1.72 1.59 1.60 3.00 1.82 
54 5.17 4.76 4.00 3.00 1.82 
55 31.90 34.13 28.00 24.02 10.00 
56 100.00 100.00 100.00 100.00 73.36 
57 19.83 25.40 58.40 96.10 100.00 
69 29.31 26.98 21.60 16.02 9.09 
83 5.17 4.76 4.00 4.00 3.64 
84 70.69 68.25 58.40 36.04 18.18 
85 6.90 6.35 4.80 5.00 2.73 
86 3.45 4.76 12.80 24.02 22.73 


Mean 25.20 26.66 30.92 36.33 30.86 


20 R.W. Rozett and E.M. Petersen, Anal. Chem., 1975, 47, 1301. 
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Figure 18 The mass spectra recorded from five mixtures of cyclohexane and hexane 


spectra, i.e. how many components are in the mixtures. We can then attempt to 
identify the nature or source of each extracted component. These are the aims 
of factor analysis. 

Before we can compute the eigenvectors associated with our data matrix, we 
need to select appropriate, if any, preprocessing methods for the data, and the 
form of the dispersion matrix. Specifically, we can choose to generate a 
covariance matrix or a correlation matrix from the data. Each of these could be 
derived from the original, origin-centred data or from transformed, mean- 
centred data. In addition, we should bear in mind the aim of the analysis and 
decide whether the variables for numerical analysis are the m/z values or the 
composition of the sample mixtures themselves. Thus we have eight options in 
forming the transformed, symmetric matrix for extracting eigenvectors. We can 
form a 5 X 5 covariance, or correlation, matrix on the origin- or mean-centred 
compositional values. Alternatively, a 17 X 17 covariance, or correlation, 
matrix can be formed from origin- or mean-centred m/z values. 

Each of these transformations can be expressed in matrix form as a transform 
of the data matrix X to a new matrix Y followed by calculating the appropriate 
dispersion matrix, C (the уагіапсе-соуагіапсе, or correlation matrix). The 
relevant equations are 


Y=XA+B (33) 
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and 
C - YT. Y/(n – 1) (34) 


The nature of C depends on the definition of 4 and B. A is a scaling diagonal 
matrix; only the diagonal elements need be defined. B is a centring matrix in 
which all elements in any one column are identical. 

For covariance about the origin, 


ay=1 and ӛ,-0 (35) 
For covariance about the mean, 
а; = 1 and bj =- X; (36) 


For correlation about the origin, 


by =0 (37) 


For correlation about the mean, 


P 1 n E -1/2 
ay = (25 У (х - 2) ) 


i=] 
by = Xj.aj (38) 


where x; is the mean value of column j from the data matrix. 

Mean-centring is a common pre-processing transformation as it provides 
data which are symmetric about a zero mean. It is recommended as a pre- 
processing step in many applications. This is not necessary here with our mass 
spectra, however, as the intensity scale has a meaningful zero. 

As the analytical data are all in the same units and cover a similar range of 
magnitude, standardization is not required either and the variance-covariance 
matrix will be used as the dispersion matrix. 

The final decision to be made is to whether to operate on the m/z values or 
the samples (actually the mixture compositions) as the analytical variables. It is 
a stated aim of our factor analysis to determine some physical meaning of the 
derived factors. We do not wish simply to perform a mathematical trans- 
formation to reduce the dimensionality of the data, as would be the case with 
principal components analysis. 

We will proceed, therefore, with an eigenvector analysis of the 5 х 5 
covariance matrix obtained from zero-centred object data. This is referred to as 
Q-mode factor analysis and is complementary to the scheme illustrated pre- 
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Table 8 The variance-covariance matrix for the MS data and the eigenvalues 
and eigenvectors extracted from this 


Covariance matrix 


A B C D E 
A 726.62 
B 726.04 734.27 
С 683.14 713.48 808.63 
р 594.23 655.34 890.97 1157.49 
Е 421.65 485.71 754.65 1065.58 1025.79 
Eigenvalues 
Factor 1 2 3 4 5 
Eigenvalue 3744.08 703.56 2.84 1.25 1.07 
Cumulative 96 84.08 99.88 99.95 99.98 100.00 
contribution 
Eigenvectors 
F(I) F(II) Е(Ш) F(IV) F(V) 
A 0.368 0.557 — 0.424 0.607 — 0.075 
B 0.389 0.486 0.401 — 0.462 — 0.488 
C 0.461 0.121 — 0.193 — 0.434 0.740 
D 0.533 — 0.359 0.605 — 0.446 0.146 
E 0.464 — 0.557 — 0.505 - 0.176 — 0.434 


viously with principal components analysis. In the earlier example the disper- 
sion matrix was formed between the measured trace metal variables, and the 
technique is sometimes referred to as R-mode analysis. For the current MS 
data, processing by R-mode analysis would involve the data being scaled along 
each m/z column and information about relative peak sizes in any single 
spectrum would be destroyed. In Q-mode analysis, any scaling is performed 
within a spectrum and the mass fragmentation pattern for each sample is 
preserved. 

The variance-covariance matrix of the mass spectra data is presented in 
Table 8, along with the results of computing the eigenvectors and eigenvalues 
from this matrix. In factor analysis we assume that any relationships between 
our samples from within the original data set can be represented by p mutually 
uncorrelated underlying factors. The value of p is usually selected to be much 
less than the number of original variables. These p underlying factors, or new 
variables, are referred to as common factors and may be amenable to physical 
interpretation. The remaining variance not accounted for by the p factors will 
be contained in a unique factor and may be attributable, for example, to noise in 
the system. 

The first requirement is to determine the appropriate value of p, the number 
of factors necessary to describe the original data adequately. If p cannot be 
specified then the partition of total variance between common and unique 
factors cannot be determined. For our simple example with the mass spectra 
data it appears obvious that p — 2, i.e. there are two common factors which we 
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Figure 19 The scree plot of the eigenvalues extracted from the MS data 


may interpret as being due to two components in the mixtures. The eigenvalues 
drop markedly from the second to the third value, as can be seen from Table 8 
and Figure 19. The first two factors account for more than 99% of the total 
variance. The choice is not always so clear, however, and in the chemometrics 
literature a number of more objective functions have been described to select 
appropriate values of p.!^ 

dà eigenvectors in Table 8 have been normalized, i.e. each vector has unit 
length, viz., 


(0.368)? + (0.389)? + (0.461)? + (0.533)? + (0.464)? = 1 (39) 


To perform factor analysis, the eigenvectors should be converted so that each 
vector length represents the magnitude of the eigenvalue. This conversion is 
achieved by multiplying each element in the normalized eigenvector matrix by 
the square root of the corresponding eigenvalue. From Table 8, the variance 
associated with the first factor is its eigenvalue, 3744.08, and the first eigenvec- 
tor is converted to the first factor by multiplying by V(3744.08), viz. 


0.368 (3744.08) 22.52 
0.389 (3744.08) 23.82 
Factor 1 = | 0.461 (3744.08) | = | 28.24 (40) 
0.533 Vv (3744.08) 32.64 
0.464 (3744.08) 29.40 


The elements in each of the factors are the factor loadings, and the complete 
factor loading matrix for our MS data is given in Table 9. This conversion has 
not changed the orientation of the factor axes from the original eigenvectors 
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Table 9 The factor loading matrix from а Q-mode analysis of the MS data 


Е(1) Е(П) Е(Ш) F(IV) F(V) 
A 22.52 14.78 -0.71 0.68 — 0.08 
B 23.82 12.88 0.68 — 0.52 — 0.51 
C 28.24 3.20 — 0.32 — 0.48 0.77 
D 32.64 — 9.51 1.02 0.50 0.15 
E 28.40 — 14.78 — 0.85 — 0.20 — 0.45 


but has simply changed their absolute magnitude. The lengths of each vector 
are now equal to the square root of the eigenvalues, i.e. the factors represent the 
standard deviations. 

From Table 8, the first factor accounts for 3744.08/4452.80 — 849^ of the 
variance in the data. Of this, 22.522/3744.08 = 13.5% is derived from object or 
sample A, 15.2% from B, 21.3% from C, 28.5% from D, and 21.5% from E. 
The total variance associated with object A is accounted for by the five factors. 
Taking the square of each element in the factor matrix (remember, these are 
standard deviations) and summing for each object provides the amount of 
variance contributed by each object. 

For sample A, 


22.522 + 14.78? + ( — 0.71)? + 0.68? + ( — 0.08? = 726.6 (41) 
and for the other samples, 


B, 23.822 + 12.88? + 0.68? + 0.522 + 0.51? = 734.3 

C, 28.24? + 3.20? + 0.32? + 0.48? + 0.77? = 808.6 

D, 32.64? + 9.51? + 1.02? + 0.50? + 0.15? = 1157.5 

E, 28.40? + 14.78? + 0.85? + 0.20? + 0.45? = 1025.8 (42) 


These values are identical to the diagonal elements of the variance- 
covariance matrix from the original data, Table 8, and represent the variance of 
each object. With all five factors, the total variance is accounted for. If we use 
fewer factors to explain the data, and this is after all the point of performing a 
factor analysis, then these totals will be less than 100%. Using just the first two 
factors, for example, then 


А, 22.52? + 14.78? = 725.6, 725.6/726.6 = 0.999 

B, 23.82? + 12.88? = 733.3, 733.3/734.3 = 0.999 

C, 28.24? + 3.20? = 807.7, 807.7/808.6 = 0.999 

D, 32.64? + 9.51? = 1155.8, 1155.8/1157.5 = 0.998 

E, 28.40? + 14.78? = 1025.0, 1025.0/1025.8 = 0.999 (43) 


The final values listed in Equation 43 represent the fraction of each object's 
variance explained by the two factors. They are referred to as the communality 
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values, denoted h?, and they depend on the number of factors used. As the 
number of factors retained increases, then the communalities tend to unity. The 
remaining (1 — Л?) fraction of the variance for each sample is considered as 
being associated with its unique variance and is attributable to noise. 

Returning to our mass spectra, having calculated the eigenvalues, eigen- 
vectors, and factor loadings, we must decide how many factors need be retained 
in our model. In the absence of noise in the measurements, the eigenvalues 
above the number necessary to describe the data are zero. In practice, of course, 
noise will always be present. However, as we can see with our mass spectra data 
a large relative decrease in the magnitude of the eigenvalues occurs after two 
values, so we can assume here that p = 2. Hopke’ provides an account of 
several objective functions to assess the correct number of factors. 

Having reduced the dimensionality of the data by electing to retain two 
factors, we can proceed with our analysis and attempt to interpret them. 
Examination of the columns of loadings for the first two factors in the factor 
matrix, Table 9, shows that some values are negative. The physical significance 
of these loadings or coefficients is not immediately apparent. The loadings for 
these two factors are illustrated graphically in Figure 20(a). The location of the 
orthogonal vectors in the two-factor space has been constrained by the three 
remaining but unused factors. If these three factors are not to be used then we 
can rotate the first two factors in the sample space and possibly find a better 
position for them; a position which will provide for a more meaningful inter- 
pretation of the data. Of the several factor rotation schemes routinely used in 
factor analysis, that referred to as the varimax technique is most commonly 
used. Varimax rotation moves each factor axis to a new, but still orthogonal, 
position so that the loading projections are near either the origin or the 
extremities of these new axes.? The rotation is rigid, to retain orthogonality 
between factors, and is undertaken using an iterative algorithm. 

Using the varimax rotation method, the rotated factor loadings for the first 
two factors from the mass spectra data are given in Table 10 and are plotted in 
Figure 20(b). The relative position of the objects to each other has remained 
unchanged, but all loadings are now positive. In fact, all loadings are present in 
the first quadrant of the diagram and in an order we can recognize as corres- 
ponding to the mixtures' compositional analysis. Sample A is predominantly 
cyclohexane (90%) and E hexane (90%). Examination of Figure 20(b) indicates 
how we could identify the nature of the two components if they were unknown, 
as would be the case with a real set of samples of course. Presumably, if the 
mass spectra of the two pure components were present in, or added to, the 
original data matrix, then the loadings associated with these samples would 
align themselves closely with the pure factors. Such, in fact, is the case. Table 11 
provides the normalized mass spectra of cyclohexane and hexane, and Table 12 
gives the varimax rotated factor loadings for the now 7 х7 variance- 
covariance matrix from the data now containing the mixtures and the pure 
components. The loadings of the first two factors for the seven samples are 
illustrated in Figure 21. As expected the single-components spectra are closely 
aligned with the axes of the derived and rotated factors. 
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Figure 20 The original factor loadings obtained from the MS data (a) and the rotated 


factor loadings, following varimax rotation, with only two factors retained in 
the model (b) 
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Figure 21 The rotated factor loadings for the MS data including the pure component 
spectra, XI (cyclohexane) and X2 (hexane) 
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Table 10 The rotated factor loading matrix, retaining only the first two factors 
and using varimax 


Table 11 Normalized mass spectral data for сусіоһехапе and hexane 


míz 


27 
28 
29 
39 
40 
41 
42 
43 
44 
54 
55 
56 
57 
69 
83 
84 
85 
86 


mgomwo» 


F(I) 


7.16 
9.39 
19.10 
30.80 
31.09 


Cyclohexane 


13.3 
15.5 
9.6 
18.5 
5.2 
52.6 
25.9 
16.3 
1.5 
5.9 
34.1 
100.0 
8.9 
28.1 
5.9 
79.3 
6.7 
0.7 


F(II) 


25.97 
25.40 
21.04 
14.40 

7.64 


Hexane 


22.9 
16.4 
41.8 
13.1 
3.3 
72.1 
38.5 
70.5 
2.5 
0.8 
74 
55.7 
100.0 
1.6 
0.8 
0.8 
0.8 
22.9 


Table 12 The rotated factor loadings for the complete MS data set, including the 


two pure components, cyclohexane ( X1) and hexane ( X2) 


Nxmoouw» 


F(I) 


3.23 
6.45 
8.71 
18.51 
30.69 
30.85 
31.12 


F(II) 


28.05 
26.15 
25.63 
21.57 
15.36 

8.50 

2.45 
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Varimax rotation is a commonly used and widely available factor rotation 
technique, but other methods have been proposed for interpreting factors from 
analytical chemistry data. We could rotate the axes in order that they align 
directly with factors from expected components. These axes, referred to as test 
vectors, would be physically significant in terms of interpretation and the 
rotation procedure is referred to as target transformation. Target trans- 
formation factor analysis has proved to be a valuable technique in chemo- 
metrics.?! The number of components in mixture spectra can be identified and 
the rotated factor loadings in terms of test data relating to standard, known 
spectra, can be interpreted. 

In this chapter we have been able to discuss only some of the more common 
and basic methods of feature selection and extraction. This area is a major 
subject of active research in chemometrics. The effectiveness of subsequent data 
processing and interpretation is largely governed by how well our analytical 
data have been summarized by these methods. The interested reader is encour- 
aged to study the many specialist texts and journals available to appreciate the 
wide breadth of study associated with this subject. 


2 Р.К. Hopke, D.J. Alpert, and В.А. Roscoe, Comput. Chem., 1983, 7, 149. 
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Pattern Recognition I — 
Unsupervised Analysis 


1 Introduction 


Itis an inherent human trait that, presented with a collection of objects, we will 
attempt to classify them and organize them into groups according to some 
Observed or perceived similarity. Whether it is with childhood toys and sorting 
blocks by shape or into coloured sets, or with hobbies devoted to collecting, we 
obtain satisfaction from classifying things. This characteristic is no less evident 
in science. Indeed, without the ability to classify and group objects, data, and 
ideas, the vast wealth of scientific information would be little more than a single 
random set of data and be of little practical value or use. There are simply too 
many objects or events encountered in our daily routine to be able to consider 
each as an individual and discrete entity. 

Instead, it is common to refer an observation or measure to some previously 
catalogued, similar example. The organization in the Periodic Table, for 
example, allows us to study group chemistry with deviations from general 
behaviour for any element to be recorded as required. In a similar manner, 
much organic chemistry can be catalogued in terms of the chemistry of generic 
functional groups. In infrared spectroscopy, the concept of correlation between 
spectra and molecular structures is exploited to provide the basis for spectral 
interpretation; in general each functional group exhibits well defined regions of 
absorption. 

Although the human brain is excellent at recognizing and classifying patterns 
and shapes, it performs less well if an object is represented by a numerical list of 
attributes, and much analytical data is acquired and presented in such a form. 
Consider the data shown in Table 1, obtained from an analysis of a series of 
alloys. This is only a relatively small data set but it may not be immediately 
apparent that these samples can be organized into well defined groups defining 
the type or class of alloy according to their composition. The data from Table 1 
are expressed diagrammatically in Figure 1. Although we may guess that there 
are two similar groups based on the Ni, Cr, and Mn content, the picture suffers 
from the presence of extraneous data. The situation would be more complex 
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Table 1 Concentration, expressed as mg kg !, of trace metals in six alloy 


samples 
Ni Cr Mn V Co 
A 6.1 1.1 1.2 5.2 4.0 
B 1.6 3.8 4.1 4.0 3.9 
С 1.2 31 2.9 3.6 6.4 
р 5.1 1.5 1.6 42 2232 
Е 4.8 1.8 1.2 3.7 3.0 
F 2.1 3.4 4.4 4.3 4.1 
ФА 
" в 
7 ^ с 
Y D 
6 * E 
ФЕ 
5 
Р, 
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Ni Cr Mn у Со 


Figure 1 The trace metal аша from Table 1 plotted to illustrate the presence of two 
groups 


still if more objects were analysed or more variables were measured. As modern 
analytical techniques are able to generate large quantities of qualitative and 
quantitative data, it is necessary to seek and apply formal methods which can 
serve to highlight similarities and differences between samples. The general 
problem is one of classification and the contents of this chapter are concerned 
with addressing the following, broadly stated task. Given a number of objects 
or samples, each described by a set of measured values, we are to derive a 
formal mathematical scheme for grouping the objects into classes such that 
objects within a class are similar, and different from those in other classes. The 
number of classes and the class characteristics are not known a priori but are to 
be determined from the analysis.! 

It is the last statement in the challenge facing us that distinguishes the 


1 B. Everitt, ‘Cluster Analysis’, 2nd Edn, Heinemann Educational, London, UK, 1980. 
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Variable 2 








Variable 1 


Figure 2 What constitutes a cluster and its boundary will depend on interpretation as well 
as the clustering algorithm employed 


techniques studied here from supervised pattern recognition schemes to be 
examined in Chapter 5. In supervised pattern recognition, a training set is 
identified with which the parent class or group of each sample is known, and 
this information is used to develop a suitable discriminant function with which 
new, unclassified samples can be examined and assigned to one of the parent 
classes. In the case of unsupervised pattern recognition, often referred to as 
cluster analysis or numerical taxonomy, no class knowledge is available and no 
assumptions need be made regarding the class to which a sample may belong. 
Cluster analysis is a powerful investigative tool, which can aid in determining 
and identifying underlying patterns and structure in apparently meaningless 
tables of data. Its use in analytical science is widespread and increasing. Some 
of the varied areas of its application are model fitting and hypothesis testing, 
data exploration, and data reduction.?? 

The general scheme, or algorithm, followed in order to perform unsupervised 
pattern recognition and undertake cluster analysis, proceeds in the following 
manner. The data set comprising the original, or suitably processed, analytical 
data characterizing our samples is first converted into some corresponding set 
of similarity, or dissimilarity, measures between each sample. The subsequent 
aim is to ensure that similar objects are clustered together with minimal 
separation between objects in a class or cluster, whilst maximizing the separa- 
tion between different clusters. It is the concept of similarity between objects 
that provides the richness and variety of the wide range of techniques available 
for cluster analysis. To appreciate this concept, it is worth considering what 
may constitute a cluster. In Figure 2, two variate measures on a set of 50 


2 J. Chapman, ‘Computers in Mass Spectrometry’, Academic Press, London, UK, 1978. 
з H.L.C. Meuzelaar and Т.І. Isenhour, ‘Computer Enhanced Analytical Spectroscopy’, Plenum 
Press, New York, USA, 1987. 
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samples are represented in a simple scatter plot. It is evident from visual 
inspection that there are many ways of dividing the pattern space and pro- 
ducing clusters or groups of objects. There is no single 'correct' result, and the 
success of any clustering method depends largely on what is being sought, and 
the intended subsequent use of the information. Clusters may be defined 
intuitively and their structure and contents will depend on the nature of the 
problem. The presence of a cluster does not readily admit precise mathematical 
definition. 


2 Choice of Variable 


In essence, what all clustering algorithms aim to achieve is to group together 
similar, neighbouring points into clusters in the n-dimensional space defined by 
the n-variate measures on the objects. As with supervised pattern recognition 
(see Chapter 5), and other chemometric techniques, the selection of variables 
and their pre-processing can greatly influence the outcome. It is worth 
repeating that cluster analysis is an exploratory, investigative technique and a 
data set should be examined using several different methods in order to obtain a 
more complete picture of the information contained in the data. 

The initial choice of the measurements made and used to describe each object 
constitute the frame of reference within which the clusters are to be established. 
This choice will reflect an analyst's judgement of relevance for the purpose of 
classification, based usually on prior experience. In most cases the number of 
variables is determined empirically and often tends to exceed the minimum 
required to achieve successful classification. Although this situation may guar- 
antee satisfactory classification, the use of an excessive number of variables can 
severely effect computation time and a method's efficiency. Applying some 
preprocessing transformation to the data is often worthwhile. Standardization 
of the raw data can be undertaken, and is particularly valuable when different 
types of variable are measured. But it should be borne in mind that standardi- 
zation can have the effect of reducing or eliminating the very differences 
between objects which are required for classification. Another technique worth 
considering is to perform a principal components analysis on the original data, 
to produce a set of new, statistically independent variables. Cluster analysis can 
then be performed on the first few principal components describing the major- 
ity of the samples' variance. 

Finally, having performed a cluster analysis, statistical tests can be employed 
to assess the contribution of each variable to the clustering process. Variables 
found to contribute little may be omitted and the cluster analysis repeated. 


3 Measures between Objects 


In general, clustering procedures begin with the calculation of a matrix of 
similarities or dissimilarities between the objects.! The output of the clustering 
process, in terms of both the number of discrete clusters observed and the 
cluster membership, may depend on the similarity metric used. 
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Similarity and distance between objects are complementary concepts for 
which there is no single formal definition. In practice, distance as a measure of 
dissimilarity is a much more clearly defined quantity and is more extensively 
used in cluster analysis. 


Similarity Measures 


Similarity or association coefficients have long been associated with cluster 
analysis, and it is perhaps not surprising that the most commonly used is the 
correlation coefficient. Other similarity measures are rarely employed. Most are 
poorly defined and not amenable to mathematical analysis, and none have 
received much attention in the analytical literature. The calculation of corre- 
lation coefficients is described in Chapter One, and Table 2(a) provides the full 
symmetric matrix of these coefficients of similarity for the alloy data from Table 
1. With such a small data set, a cluster analysis can be performed manually to 
illustrate the stages involved in the process. The first step is to find the mutually 
largest correlation in the matrix to form centres for the clusters. The highest 
correlation in each column of Table 2(a) is shown in boldface type. Objects A 
and D form mutual highly correlated pairs, as do objects B and C. Note that 
although object E is most highly correlated with D, they are not considered as 
forming a pair as D most resembles A rather than E. Similarly, object F is not 
paired with B, as B is more similar to C. 


Table 2 The matrix of correlations between objects from Table 1, (a). Samples A 
and D, B and C form new objects and a new correlation matrix can be 
calculated, (b). Sample E then joins AD and F joins BC to provide the 
final step and apparent correlation matrix, (c) 


(a) A B с р Е Е 
А 1 -062 -039 09 097 -044 
B -062 1 0.95 -068 -074 0.94 
C -039 095 1 -048 -0.51 099 
D 099 -068 -048 1 098 -051 
E 097 -074 -051 098 1 – 0.61 
F -044 094 089 -051 -06і 1 
(b) AD BC E F 
AD 1! -054 097 -047 
BC -054 1 —0.2 0.91 
E 097 -062 1 — 0.61 
F -047 091 -061 1 
(c) АРЕ ВСЕ 
АРЕ 1 — 0.56 


ВСЕ  - 0.56 1 
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Figure 3 Т/е stages of hierarchical clustering shown ар екн аз dendrograms. The 
steps (a)-(c) correspond to the connections calculated in Table 2 


The resemblance between the mutual pairs is indicated in the diagram shown 
in Figure 3, which links А to D and B to C by a horizontal line drawn from the 
vertical axis at points representing their respective correlation coefficients. 

At the next stage, objects А and D, and B and C, are considered to comprise 
new, distinct objects with associative properties and are similar to the other 
objects according to their average individual values. Table 2(b) shows the newly 
calculated correlation matrix. Clusters AD and BC have a correlation coeffi- 
cient calculated from the sum of the correlations of A to B, D to B, A to C, and 
D to C, divided by four. The correlation between AD and E is the average of the 
original A to E and D to E correlations. The clustering procedure is now 
repeated, and object E joins cluster AD and object F joins BC, Figure 3(b). The 
process is continued until all clusters are joined and the final similarity matrix is 
produced as in Table 2(c) with the resultant diagram, a dendrogram, shown in 
Figure 3(c). That two groups, ADE and BCF, may be present in the original 
data is demonstrated. 

From this extremely simple example, the basic steps involved in cluster 
analysis and the value of the technique in classification are evident. The final 
dendrogram, Figure 3(c), clearly illustrates the similarity between the different 
samples. The original raw tabulated data have been reduced to a pictorial form 
which simplifies and demonstrates the structure within the data. It is pertinent 
to ask, however, what information has been lost in producing the diagram and 
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Table 3 The matrix of apparent correlations between the six alloy samples, 
derived from the dendrogram of Figure 3 and Table 2(b) and (c) 


(a) A B C D E F 
А 1 -0.56 -056 099 0.97 -0.56 
В -056 1 0.95 -056 -0.56 0.91 
С -0.56 -0.95 1 -0.56 -0.56 0.91 
D 099 -056 -056 1 0.97 -0.56 
E 097 -056 -0.56 097 1 — 0.56 
F -056 091 091 -0.56 -056 1 
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Figure 4 True vs. apparent correlations indicating the distortion achieved by averaging 
correlation values to produce the dendrogram 


to what extent does the graph accurately represent our original data. From the 
dendrogram and Table 2(b) the apparent correlation between sample B and 
sample F is 0.91, rather than the true value of 0.94 from the calculated similarity 
matrix. This error arose owing to the averaging process in treating the BC pair 
as a single entity, and the degree of distortion increases as successive levels of 
clusters are averaged together. Table 3 is the matrix of apparent correlations 
between objects as obtained from the dendrogram. These apparent correlations 
are sometimes referred to as cophenetic values, and if these are plotted against 
actual correlations, Figure 4, then a visual impression is obtained of the 
distortion in the dendrogram. A numerical measure of the similarity between 
the values can be calculated by computing the linear correlation between the 
two sets. If there is no distortion, then the plot would form a straight line and 
the correlation would be 1. In our example this correlation, r = 0.99. Although 
such a high value for r may indicate a strong linear relationship, as Figure 4 
shows there is considerable difference between the real and apparent corre- 
lations. 
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Distance Measures 


The correlation coefficient is too limiting in its definition to be of value in many 
applications of cluster analysis. It is a measure only of colinearity between 
variates and takes no account of non-linear relationships or the absolute 
magnitude of variates. Instead, distance measures which can be defined 
mathematically are more commonly encountered in cluster analysis. Of course, 
itis always possible at the end of a clustering process to substitute distance with 
reverse similarity; the greater the distance between objects the less their simi- 
larity. 

An object is characterized by a set of measures, and it may be represented as 
a point in multidimensional space defined by the axes, each of which corre- 
sponds to a variate. In Figure 5, a data matrix X defines measures of two 
variables on two objects A and B. Object A is characterized by the pattern 
vector, а = X11, X12, and B by the pattern vector, b = x2), X22. Using a distance 
measure, objects or points closest together are assigned to the same cluster. 
Numerous distance metrics have been proposed and applied in the scientific 
literature. 

For a function to be useful as a distance metric between objects then the 
following basic rules apply (for objects A and B): 


(a) dap > 0, the distance between all pairs of measurements for objects А and 
B must be non-negative, 

(b) dap = авд, the distance measure is symmetrical and can only be zero 
when А - B. 

(c) dac + авс > dag, the distance is commutative for all pairs of points. This 
statement corresponds to the familiar triangular inequality of Euclidean 
geometry. 


The most commonly referenced distance metric is the Euclidean distance, 
defined by 


1/2 
dag = b (х,- xy (1) 


where х,у is the value of the fth variable measured on the ?'th object. This 
equation can be expressed in vector notation as 


1/2 
адв = р (а; — by (2) 
ог 


das = [(a — B)" (a – Ы” (3) 


Іп general, most distance metrics conform to the general Minkowski 
equation, 
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1/т 
dap = [ze = x (4) 


When т = 1, Equation (4) defines the city-block metric, and if m = 2 then the 
Euclidean distance is defined. Figure 5 illustrates these measures on two- 
dimensional data. 

If the variables have been measured in different units, then it may be 
necessary to scale the data to make the values comparable.*" An equivalent 
procedure is to compute a weighted Euclidean distance, 


1/2 
г |» Za- | (5) 


ог 

das = [(a — b)": Wa — Бур? (6) 

W is a symmetric matrix, and in the most simple case W is a diagonal matrix, 

the diagonal elements of which, w,, are the weights or coefficients applied to the 
x 


A Xu Xs 
В, х. Xn 


d3 


d,, (Euclidean) = 91 
d,,(city-block) = d2 + d3 


Figure 5 The Euclidean and city-block metrics for two-dimensional data 


4 J. Hartigan, ‘Clustering Algorithms’, J. Wiley and Sons, New York, USA, 1975. 

5 A.A. Afifi and V. Clark, ‘Computer Aided Multivariate Analysis’, Lifetime Learning, California, 
USA, 1984. 

6 ‘Classification and Clustering’, ed. J. Van Ryzin, Academic Press, New York, USA, 1971. 

7 DJ. Hand, ‘Discrimination and Classification’, J. Wiley and Sons, Chichester, UK, 1981. 
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vectors corresponding to the variables in the data matrix. Weighting variables is 
largely subjective and may be based on a priori knowledge regarding the data, 
such as measurement error or equivalent variance of variables. If, for example, 
weights are chosen to be inversely proportional to measurement variance, then 
variates with greater precision are weighted more heavily. However, such 
variates may contribute little to an effective clustering process. 

One weighted distance measure which does occur frequently in the scientific 
literature is the Mahalanobis distance,** 


das = [(a 57: Cov- (a – 5]? | 


where Cov is the full variance-covariance matrtix for the original data. The 
Mahalanobis distance is invariant under any linear transformation of the 
original variables. If several variables are highly correlated, this type of weight- 
ing scheme down-weights their individual contributions. It should be used with 
care, however. In cluster analysis, use of the Mahalanobis distance may 
produce even worse results than equating the variance of each variable and may 
serve only to decrease the clarity of the clusters.* 

Before proceeding with a more detailed examination of clustering techniques, 
we can now compare correlation and distance metrics as suitable measures of 
similarity for cluster analysis. A simple example serves to illustrate the main 
points. In Table 4, three objects (A, B, and C) are characterized by five variates. 
The correlation matrix and Euclidean distance matrix are given in Tables 5 and 


Table 4 Three samples, or objects, characterized by five measures, хі... х5 


Xi х, хз Ха Xs 


2.1 5.2 3.1 41 24 
2.5 4.0 4.0 4.6 3.5 
54 9.2 7.1 7.0 5.0 


Ow» 


Table 5 The correlations matrix of (a), of data from Table 4 and clustering 
objects with highest mutual correlation, (b) 


(a) A В C 


A 1 0.69 0.96 
B 0.69 1 0.63 
C 0.6 063 1 


and the first cluster is AC 
(b) AC B 
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Table 6 The Euclidean distance matrix (a) of data from Table 4 and clustering 
objects with the smallest inter-object distance, (b) 


(a) A B с 
0 215 7.60 


А 
В 215 0 7.17 
С 760 717 0 


and the first cluster is АВ 


(b) AB С 
AB 0 738 
B 738 0 
А с в 


correlation 


» 
o 
о 


(b) 


distance 


Figure 6 Dendrograms for the three-object data set from Table 4, clustered according to 
correlation (a) and distance (b) 


6 respectively, and, as before, manual clustering can be undertaken to display 
the similarity between objects. Using the correlation coefficient, objects À and 
C form a mutually highly similar pair and may be joined to form a new object 
AC, with a correlation to object B formed by averaging the A to B, C to B 
correlations. The resultant dendrogram is shown in Figure 6(a). If the 
Euclidean distance matrix is used as the measure of similarity, then objects A 
and B are the most similar as they have the mutually lowest distance separating 
them. The dendrogram using Euclidean distance is illustrated in Figure 6(b). 
Different results may be obtained using different measures. The explanation 
can be appreciated by considering the original data plotted as in Figure 7. If the 
variables хі, х... xs, represent trace elements іп, say, water samples and the 
measures their individual concentrations, then samples А and B would form a 
group with the differences possibly due to natural variation between samples or 
experimental error. Sample C could come from a different source. The distance 
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Figure 7 The three-object from Table 4 


metric in this case provides a suitable clustering measure. On the other hand, if 
Xi, X2 . . . Xs denoted, say, wavelengths and the response values a measure of 
absorption or emission at these wavelengths, then a different explanation may 
be sought. It is clear from Figure 7 that if the data represent spectra, then A and 
C are similar, differing only in scale or concentration, whereas spectrum B has a 
different profile. Hence, correlation provides a suitable measure of similarity. 
If, as spectra, the data had been normalized to the most intense response, then 
A and C would have been closer and the distance metric more meaningful. 

In summary, therefore, the first stage in cluster analysis is to compute the 
matrix of selected distance measures between objects. As the entire clustering 
process may depend on the choice of distance it is recommended that results 
using different functions are compared. 


4 Clustering Techniques 


By grouping similar objects, clusters are themselves representative of those 
objects and form a distinct group according to some empirical rule. It is implicit 
in producing clusters that such a group can be represented further by a typical 
element of the cluster. This single element may be a genuine member of the 
cluster or a hypothetical point, for example an average of the contents' char- 
acteristics in multidimensional space. One common method of identifying a 
cluster's typical element is to substitute the mean values for the variates 
describing the objects in the cluster. The between-cluster distance can then be 
defined as the Euclidean distance, or other metric, between these means. Other 
measures not using the group means are available. The nearest-neighbour 
distance defines the distance between two closest members from different 
groups. The furthest-neighbour distance on the other hand is that between the 
most remote pair of objects in two groups. А further inter-group measure is 
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obtained by taking the average of all the inter-element measures between 
elements in different groups. As well as defining the inter-group separation 
between clusters, each of these measures provides the basis for a clustering 
technique, defining the method by which clusters are constructed or divided. 

In relatively simple cases, in which only two or three variables are measured 
for each sample, the data can usually be examined visually and any clustering 
identified by eye. As the number of variates increases, however, this is rarely 
possible and many scatter plots, between all possible pairs of variates, would 
need to be produced in order to identify major clusters, and even then clusters 
could be missed. To address this problem, many numerical clustering tech- 
niques have been developed, and the techniques themselves have been 
classified. For our purposes the methods considered belong to one of the 
following types. 

(a) Hierarchical techniques in which the elements or objects are clustered to 
form new representative objects, with the process being repeated at different 
levels to produce a tree structure, the dendrogram. 

(b) Methods employing optimization of the partitioning between clusters 
using some type of iterative algorithm, until some predefined minimum change 
in the groups is produced. 

(c) Fuzzy cluster analysis in which objects are assigned a membership func- 
tion indicating their degree of belonging to a particular group or set.5? 

In order to demonstrate the calculations and results associated with the 
different methods, the small set of bivariate data in Table 7 will be used. These 
data comprise 12 objects in two-dimensional space, Figure 8, and the positions 


73 


x2 





xi 


Figure8 The bivariate data from Table 7(a)'° 


3 A, Kandel, ‘Fuzzy Mathematical Techniques with Applications’, Addison-Wesley, Massa- 
chusetts, USA, 1986. 

? J.C. Bezdek, ‘Pattern Recognition with Fuzzy Objective Function Algorithms’, Plenum Press, 
New York, USA, 1981. 

10 J, Zupan, “Clustering of Large Data Sets’, J. Wiley and Sons, Chichester, UK, 1982. 
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Table 7 А simple bivariate data set for cluster analysis (a), from Zupan, and 
the corresponding Euclidean distance matrix, (b) 


(a) A B C D E F СН I J K L 
Xo 2 6 7 8 1з 2 7 6 7 6 22 
X% 1l UD d Ok 2 2 3 3 4 4 5 6 

(b) A B C D E F G H I J K L 
A 0 40 50 60 14 14 20 54 50 58 57 50 
В 400 10 20 51 32 45 22 30 32 40 64 
С 50100 10 61 41 54 20 32 30 41 71 
D 60 20 10 0 71 51 63 22 36 32 45 78 
E 14 51 61 71 0 20 14 61 54 63 58 44 
Е 14 32 41 51 20 0 14 41 36 45 42 41 
G 20 45 54 63 14 14 0 50 41 51 45 30 
Н 54 22 20 22 61 41 50 0 14 10 22 58 
I 50 30 32 36 54 36 41 14 0 10 10 45 
J 58 32 30 32 63 45 51 10 10 0 14 54 
K 57 40 41 45 58 42 45 22 10 14 0 41 
L 50 64 71 78 4l 41 30 58 45 54 41 0 


of the points are representative of different shaped clusters, the single point (L), 
the extended group (B,C,D), the symmetrical group (A,E,F,G), and the asym- 
metrical cluster (H,I,J,K).!° 


Hierarchical Techniques 


When employing hierarchical clustering techniques, the original data are separ- 
ated into a few general classes, each of which is further subdivided into still 
smaller groups until finally the individual objects themselves remain. Such 
methods may be agglomerative or divisive. By agglomerative clustering, small 
groups, starting with individual samples, are fused to produce larger groups as 
in the examples studied previously. In contrast, divisive clustering starts with a 
single cluster, containing all samples, which is successively divided into smaller 
partitions. Hierarchical techniques are very popular, not least because their 
application leads to the production of a dendrogram which can provide a 
two-dimensional pictorial representation of the clustering process and the 
results. Agglomerative hierarchical clustering is very common and we will 
proceed with details of its application. 

Agglomerative methods begin with the computation of a similarity or dis- 
tance matrix between the objects, and result in a dendrogram illustrating the 
succesive fusion of objects and groups until the stage is reached when all objects 
are fused into one large set. Agglomerative methods are the most common 
hierarchical schemes found in scientific literature. The entire process involved 
in undertaking agglomerative clustering using distance measures can be sum- 
marized by a four-step algorithm. 
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Step 1. Calculation of the between-object distance matrix. 

Step 2. Find the smallest elements in the distance matrix and join the 
corresponding objects into a single cluster. 

Step 3. Calculate a new distance matrix, taking into account that clusters 
produced in the second step will have formed new objects and taken 
the place of original data points. 

Step 4. Return to Step 2 or stop if the final two clusters have been fused into 
the final, single cluster. 


The wide range of agglomerative methods available differ principally in the 
implementation of Step 3 and the calculation of the distance between two 
clusters. The different between-group distance measures can be defined in terms 
of the general formula 


dka) = idk i + Ode + Ва, + y|di; — di, (8) 


where d; jis the distance between objects i and j and dj; is the distance between 
group К and a new group (i,j) formed by the fusion of groups i and j. The values 
of coefficients а; oj, В, and y are chosen to select the specific between-group 
metric to be used. Table 8 lists the more common metrics and the corresponding 
values for о, а, В, and y. 

The use of Equation (8) makes it a simple matter for standard computer 
software packages to offer a choice of distance measures to be investigated by 
selecting the appropriate values of the coefficients. 


Table 8 The common distance measures used in cluster analysis 


Coefficients 

Metric а а, Ь 4 
Nearest neighbour 0.5 0.5 0 - 0.5 
(single linkage) 
Furthest neighbour 0.5 . 0.5 0 0.5 
(complete linkage) 
Centroid —- — -а,.а, 

n; п, n; + п; 
Мейіап 0.5 0.5 — 0.25 0 
Group average „ын = c 0 0 

n, t n, n; n, 
Ward's method Mtn mtn _ p ve. c 0 

ny, nj n, ny n; n, n, tn, tn, 


The number of objects in any cluster i is n. 
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Figure 9 Dendrogram of the data from Table 7(a) using the nearest-neighbour algorithm 


For the nearest-neighbour method of producing clusters, Equation (8) 
reduces to 


дш,» = 0.54 + 0.5d,,, — 0.5|d,; — de | 9) 


From the 12 x 12 distance matrix, Table 7(b), objects В and C form a new, 
combined object and the distance from BC to each original object is calculated 
according to Equation (9). Thus, for A to BC, 


адв, = 0.5dap + 0.5dAc = 0.5 |dan = dac| 
= 0,5(4) + 0.5(5) — 0.5(1) (10) 
=4 


In fact, for the nearest-neighbour algorithm, Equation (9) can be rewritten 
as 


d, = min(d, i а.) (11) 


i.e. the distance between a cluster and an object is the smallest of the distances 
between the elements in the cluster and the object. 

The distance between the new object BC and each remaining original object 
is calculated, and the procedure repeated with the resulting 11 x 11 distance 
matrix until a single cluster containing all objects is produced. The resulting 
dendrogram is illustrated in Figure 9. 

The dendrogram for the furthest-neighbour, or complete linkage, technique is 
produced in a similar manner. In this case, Equation (8) becomes 


Фа) = 0.54,; + 0.54, ; + 0.5 |4, Е ад (12) 
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and this implies that 


dka, max(d,; dky) (13) 


i.e. the distance between a cluster and an object is the maximum of the distances 
between cluster elements and the object. 

For example, for group BC to object D, the B to D distance is 2 units and the 
C to D distance is 1 unit. From Equation (13), therefore двс = 2, or 


уво = 0.5dps + 0.5dpc = 0.5|dpp = dpc| 
= 0.5(2) + 0.5(1) + 0.5(1) 
=2 (14) 


The complete dendrogram is shown in Figure 10. The nearest-neighbour and 
furthest-neighbour criteria are the simplest algorithms to implement. 

Another procedure, Ward's method, is commonly encountered in chemo- 
metrics. A centroid point is calculated for all combinations of two clusters and 
the distance between this point and all other objects calculated. In practice this 
technique generally favours the production of small clusters. 

From Equation (8), 


dogo = 2dps/3 + 2dpc/3 — 14вс/3 
= 2(2)/3 + 2(0/3 — 10/3 
= 1.67 (15) 
The process is repeated between BC and other objects, and the iteration 
starts again with the new distance matrix until a single cluster is produced. The 
dendrogram from applying Ward's method is illustrated in Figure 11. 
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Figure 10 Dendrogram of the data from Table 7(a) using the furthest-neighbour 
algorithm 
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Figure 11 Dendrogram of the data from Table 7(a) using Ward's method 


The different methods available from applying Equation (8) with the co- 
efficients from Table 8 each produce their own style of dendrogram with their 
own merits and disadvantages. Which technique or method is best is largely 
governed by experience and empirical tests. The construction of the dendro? 
gram invariably induces considerable distortion as discussed, and other, non- 
hierarchical, methods are generally favoured when large data sets are to be 
analysed. 


The K-Means Algorithm 


One of the most popular and widely used clustering techniques is the applica- 
tion of the K-Means algorithm. It is available with all popular cluster analysis 
software packages and can be applied to relatively large sets of data. The 
objective of the method is to partition the т objects, characterized by n 
variables, into K clusters so that the square of the within-cluster sum of 
distances is minimized. Being an optimization-based technique, the number of 
possible solutions cannot be predicted and the best possible partitioning of the 
objects may not be achieved. In practice, the method finds a local optimum, 
defined as being a classification in which no movement of an observation from 
one cluster to another will reduce the within-cluster sum of squares. 

Many versions of the algorithm exist, but in most cases the user is expected to 
supply the number of clusters, K, expected. The algorithm described here is that 
proposed by Hartigan.* 

The data matrix is defined by X with elements х,, (1 €i € m, 1 &j <n), 
where т is the number of objects and n is the number of variables used to 
characterize the objects. The cluster analysis seeks to find K partitions or 
clusters, with each object residing in only one of the clusters. 
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The mean value for each variable j, for all objects in cluster L is denoted by 
В, (1 € L € К). The number of objects residing in cluster L is Rz. 

The distance, D; у, between the i'th object and the centre or average of each 
cluster is given by the Euclidean metric, 


Di, = [iy — B,y]^ (16) 
and the error associated with any partition is 
е (р)? (17) 


where L(i) is the cluster containing the ř’th object. Thus е represents the sum of 
the squares of the distances between object i and the cluster centres. 

The algorithm proceeds by moving an object from one cluster to another in 
order to reduce e, and ends when no movement can reduce e. The steps involved 
are: 


Step 1: Given K clusters and their initial contents, calculate the cluster 
means В, , and the initial partition error, e. 

Step 2: For the first object, compute the increase in error, Ae, obtained by 
transferring the object from its current cluster, L(1), to every other 
cluster L, 2€ L € К: 


- (RYDD (Ма) ыла): 


А 
t (RD +1 Guana 


(18) 


If this value for Ae is negative, i.e. the move would reduce the 
partition error, transfer the object and adjust the cluster means 
taking account of their new populations. 

Step 3: Repeat Step 2 for every object. 

Step 4: If no object has been moved then stop, else return to Step 2. 


Applying the algorithm manually to our test data will illustrate its opera- 
tion. 

Using the data from Table 7, it is necessary first to specify the number of 
clusters into which the objects are to be partitioned. We will use K = 4. Before 
the algorithm is implemented we also need to assign each object to an initial 
cluster. A number of methods are available, and that used here is to assign 
object i to cluster L(i) according to 


(19) 


Ці) = імт((к- оис шылы.) +1 


МАХ ХХ, , - MIN ХХ, , 
Ј Ј 


where Хх, у 15 the sum of all variables for each object, and MIN and МАХ 
denote the minimum and maximum sum values. 
For the test data, 
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Objects 
Variables AB CD EF GHI JKL 
Xi 267 8 13 2 761 6 2 
х 1 1 1 1 2 2 3 3 44 5 6 
EX, 3 7 8 9 3 5 S 10 1011 11 8 
MAX ХХ, ; = 11 MIN УХ, ;= 3 
For object A, 
L(A) = INT(4 — D[G - 3/(11 3] + 121 (20) 
and similarly for each object, all i, 
i= ABCDEFGHIJKL 
Цй- 122311133442 


Thus, objects A, E, F, G are assigned initially to Cluster 1, B, C, and L to 
Cluster 2, D, H, and I to Cluster 3, and finally J and K to Cluster 4. 

The centres of each of these clusters can now be calculated. For Cluster 1, 
L=1, 


By, =(2+14+3 + 2)/4= 2.00 (21) 
Ву» = (01-2424 3)/4 = 2.00 (22) 


Similarly for each of the remaining three clusters. The initial clusters and 
their mean values are therefore, 


Cluster Contents Cluster means 
X1 X2 

1 AEFG 2.00 2.00 

2 BCL 5.00 2.67 

3 DHI 7.00 2.67 

4 JK 6.50 4.50 


This initial partitioning is illustrated in Figure 12(a). 
By application of Equations (16) and (17), the error associated with this 
initial partitioning is 


e-(2-2y + (1-2)? + (6— 5)? + (1 — 2.67) + (7 – 5 
+ (1 — 2.67)? + (8 — 7? + (1 — 2.67)? + (1 - 2? + (2 – 2}? 
+ (3-2) + (2-27 + (2 2) + (3 – 2)2 + (7 – 7) 
+ (3 – 2.67)? + (6 — 7)? + (4 – 2.67 + (7 – 6.5} (23) 
+ (4 — 4.5 + (6 — 6.5)? + (5 — 4.5)? + (2 — 5)? + (6 — 2.67) 
= 42.35 


(b) 
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Figure 12 The K-means algorithm applied to the test data from Table 7(a), showing the 
initial four partitions (a) and subsequent steps, (b) to (d), until a stable result is 
achieved (e) 
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In order to reduce this error, the algorithm proceeds to examine each object 
in turn and calculate the effect of transferring an object to a new cluster. 
For the first object, A, its squared Euclidean distance to each cluster centre is 


DA? = (2.00 – 2.00)? + (1.00 – 2.00)? = 1.00 
Dax? = (2.00 — 5.00)? + (0.001 — 2.67)? = 11.79 QA) 
Da 3? = (2.00 — 7.00)? + (1.00 — 2.67)? = 17.79 

2 


Da = (2.00 — 6.50) + (1.00 — 4.50)? = 32.50 


If we were to transfer object A from Cluster 1 to Cluster 2, then the change in 
error, from Equation (18), is 


Ae = (3)(11.79)/4 — (4)(1)/3 = 7.51 (25) 
and to Cluster 3, 

Ae = (3)(27.79)/4 — (4)(1)/3 = 19.51 (26) 
and to Cluster 4, 

Ae = (3)(32.50)/4 — (4)(1)/3 = 20.34 (27) 


The Ae values are all positive and each proposed change would serve to 
increase the partition error, so object A is not moved from Cluster 1. This result 
can be appreciated by reference to Figure 12(a). Object A is closest to the centre 
of Cluster 1 and nothing would be gained by assigning it to another cluster. 

The algorithm continues by checking each object and calculating Ae for each 
object with each cluster. For our purpose, visual examination of Figure 12(a) 
indicates that no change would be expected for object B, but for object C a 
move is likely as it is closer to the centre of Cluster 3 than Cluster 2. 

Moving object C, the third object, to Cluster 1, 


Рс? = (7.00 — 2.00)? + (1.00 — 2.00): = 26.00 
and 

Ав = (4)(26.00)/5 — (3)(6.79)/2 = 3.82 (28) 
for Cluster 2, its current group, 

Dc?! = (7.00 — 5.00)? + (1.00 — 2.67)? = 6.79 (29) 
and to Cluster 3, 


Юс? = (7.00 — 7.00)? + (1.00 — 2.67)? = 2.79 
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and 
Ae = (3)(2.79)/4 — (3)(6.79)/2 = — 14.88 (30) 
and to Cluster 4, 
Dea = (7.00 — 6.5)? + (1.00 — 4.50)? = 12.50 
and 
Ae = (2)(12.50)/3 — (3)(6.79)/2 = — 8.64 (31) 
So, moving object С from Cluster 2 to Cluster 3 decreases є by 14.88, and the 


new value of e is (42.35 - 14.88) = 27.47. The partition is therefore changed. 
With new clusters and contents we must calculate their new mean values: 


Cluster Contents (1st change) Cluster means 
X1 X2 

1 AEFG 2.00 2.00 

2 BL 4.00 3.50 

3 CDHI 7.00 2.50 

4 JK 6.50 4.50 


The new partition, after the first pass through the algorithm, is illustrated in 
Figure 12(b). On the second run through the algorithm object B will transfer to 
Cluster 3; it is nearer its mean than Cluster 2. 

On the second pass, therefore, the cluster populations and their centres are, 
Figure 12(c), 


Cluster Contents (2nd change) Cluster means 
X1 X2 

1 АЕЕС 2.00 2.00 

2 L 2.00 6.00 

3 BCDHI 6.80 2.00 

4 JK 6.50 4.50 


On the next pass, object I will move to Cluster 4, Figure 12(d), 


Cluster Contents (3rd change) Cluster means 
Xi х 

1 AEFG 2.00 2.00 

2 L 2.00 6.00 

3 BCDH 7.00 1.50 

4 IJK 6.33 4.33 


On the fourth pass, object H moves from Cluster 3 to Cluster 4, Figure 12(e), 
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Cluster Contents (4th change) Cluster means 
X1 X2 

1 АЕЕС 2.00 2.00 

2 L 2.00 6.00 

3 BCD 7.00 1.00 

4 HIJK 6.50 4.00 


The process is repeated once more but this time no movement of any object 
between clusters gives a better solution in terms of reducing the value of e. So 
Figure 12(e) represents the best result. 

Our initial assumption when applying the K-means algorithm was that four 
clusters were known to exist. Visual examination of the data suggests that this 
assumption is reasonable in this case, but other values could be acceptable 
depending on the model investigated. For К = 2 and К = 3, the K-means 
algorithm produces the results illustrated in Figure 13. Although statistical tests 
have been proposed in order to select the best number of partitions, cluster 
analysis is not generally considered a statistical technique, and the choice of 
criteria for best results is at the discretion of the user. 
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Figure 13 The K-means algorithm applied to the test data from Table 7(a) assuming 
there are two clusters (a), and three clusters (b) 


Fuzzy Clustering 


The principal aim of performing a cluster analysis is to permit the identification 
of similar samples according to their measured properties. Hierarchical tech- 
niques, as we have seen, achieve this by linking objects according to some 
formal rule set. The K-means method on the other hand seeks to partition the 
pattern space containing the objects into an optimal predefined number of 
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Table 9 Bivariate data (x, and xz) measured on 15 objects, А... О 


Object, i Variable, j Object, i Variable, j 

Xi х Xi X2 
A 1 1 І 5 3 
В 1 3 7 6 2 
С 1 5 K 6 3 
D 2 2 L 6 4 
E 2 3 M 7 1 
F 2 4 N 7 3 
G 3 3 о 7 5 
н 4 3 


sections. Іп the process of providing a simplified representation of the data, 
both schemes can distort the 'true' picture. By linking similar objects and 
reducing the data to a two-dimensional histogram, hierarchical clustering often 
severely distorts the similarity value by averaging values or selecting maximum 
or minimum values. The result of K-means clustering is a simple list of clusters, 
their centres, and their contents. Nothing is said about how well any specific 
object fits into its chosen cluster, or how close it may be to other clusters. In 
Figure 12(e), for example, object C is more representative of its parent cluster 
than, say, object B which may be considered to have some of the characteristics 
of the first cluster containing objects A, E, F, G. 

One clustering method which seeks to not only highlight similar objects but 
also provide information regarding the relationship of each object to each 
cluster is that of fuzzy clustering.5? Generally referred to in the literature as the 
fuzzy c-means method, to preserve continuity of the symbols used thus far, we 
will identify the technique as fuzzy k-means clustering. The method illustrated 
here is based on Bezdek's algorithm.? 

In order to demonstrate the use and application of fuzzy clustering, a simple 
set of data will be analysed manually. The data in Table 9 represent 15 objects 
(A. . . O) characterized by two variables x, and x2, and these data are plotted in 
the scatter diagram of Figure 14. It is perhaps not unreasonable to assume that 
these data represent two classes or clusters. The means of the clusters are well 
separated but the clusters touch about points G, H, and I. Because the apparent 
groups are not well separated, the results using conventional cluster analysis 
schemes can be misleading or ambiguous. With the data from Table 9 and 
applying the K-means algorithm using two different commercially available 
software packages, the results are as illustrated in Figure 15(a) and 15(b). These 
results are confusing. Since the data are symmetrical, in the x; axis, about 
хз = 3, why should points B, E, С, Н, I, K, М belong to one cluster rather than 
the other cluster? Similarly, in Figure 15(b), in the x, axis the data are 
symmetrical about x, = 4 and there is no reason why object Н should belong 
exclusively to either cluster. The problem arises because of the crisp nature of 
the clustering rule that assigns each object to one specific cluster. This rule is 
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Figure 15 Clustering resulting from application of two commercial programs of the 
K-means algorithm (a) and (b), to the data from Table 9 


relaxed when applying fuzzy clustering and objects are recognized as belonging, 
to a lesser or greater degree, to every cluster. 

The degree or extent to which an object, i, belongs to a specific cluster, &, is 
referred to as that object's membership function, denoted ць. Thus, visual 
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inspection of Figure 14 would suggest that for two clusters objects E апа К 
would be close to the cluster centres, i.e. jig ~ 1 and нок ~ 1, and that object Н 
would belong equally to both clusters, i.e. шн = 0.5 and рон = 0.5. This is 
precisely the result obtained with fuzzy clustering. 

As with K-means clustering, the fuzzy k-means technique is iterative and 
seeks to minimize the within-cluster sum of squares. Our data matrix is defined 
by the elements x, and we seek К clusters, not by hard partitioning of the 
variable space, but by fuzzy partitions, each of which has a cluster centre or 
prototype value, В,, (1 < k < К). 

The algorithm starts with a pre-selected number of clusters, К. In addition, 
an initial fuzzy partition of the objects is supplied such that there are no empty 
clusters and the membership functions for an object with respect to each cluster 
sum to unity, 


ША + л t... t рка = 1 (32) 


Thus, if Ме 0.8, then Не = 0.2. 
The algorithm proceeds by calculating the K-weighted means іп order to 
determine cluster centres, 


M M 
By = Y (a xy] У a (33) 
iz iz 


New fuzzy partitions are then defined by a new set of membership functions 
given by, 


1 к 1 


т а (9) 
2,08 = В)" > 5 (xy — By 


k=} M71 


Ма 


i.e. the ratio of the inverse squared distance of object i from the k’th cluster 
centre to the sum of the inverse squared distances of object i to all cluster 
centres. 

From this new partitioning, new cluster centres are calculated by applying 
Equation (33), and the process repeats until the total change in values of the 
membership functions is less than some preselected value, or a set number of 
iterations has been achieved. 

Application of the algorithm can be demonstrated using the data from Table 
9. 

With К = 2, our first step is to assign membership functions for each object 
and each cluster. This process can be done in a random fashion, bearing in mind 
the constraint imposed by Equation (32), or using prior knowledge, e.g. the 
output from crisp clustering methods. With the results from the K-means 
algorithm, Figure 15(b), the membership functions can be assigned as shown in 
Table 10. Objects А... Н belong predominately to Cluster 1 and objects I... 
O to Cluster 2. 
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Table 10 Initial membership functions, цы, for 20 objects assuming two clusters 


i Ш; №; і Ш; Шә 
А 0.9 0.1 1 0.1 0.9 
В 0.9 0.1 1 0.1 0.9 
С 0.9 0.1 К 0.1 0.9 
р 0.9 0.1 L 0.1 0.9 
E 0.9 0.1 M 0.1 0.9 
F 0.9 0.1 N 0.1 0.9 
G 0.9 0.1 (0) 0.1 0.9 
H 0.9 0.1 


Using this initial fuzzy partition, the initial cluster centres can be calculated 
according to Equation (33). 


15 15 
By = og] Sou 
іші ігі 


= [(0.92)1 + (0.92)1 + (0.92)1 + (0.92)2 + (0.92)2 + (0.92)2 
+ (0.92)3 + (0.92)4 + (0.12)5 + (0.12)6 + (0.12)6 + (0.12)6 
+ (0.12)7 + (0.12)7 + (0.12)7]/[8(0.92) + 7(0.12)] 

= 2.04 


(35) 


Similarly for В,2, B21, B22, and the centres are 


By = 2.04, By = 3.00 
By = 6.20, By = 3.00 


And we can proceed to calculate new membership functions for each object 
about these centres. 

The squared Euclidean distance between object A and the centre of Cluster 1 
is, from Equation (1), 


2 
адва) = > (xy - By) 
j=l 

(1 — 2.04)? + (1 ~ 3.00)? 
5.08 


(37) 


and to Cluster 2, dap) = 31.04. The new membership functions for object A, 
from Equation (34), are therefore, 


1/5.08 


МА = 075,08) + (1/3104) 7 9:6 (38) 


and 
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Table 11 Final membership functions, дь, for 20 objects assuming two clusters 


ч. 
м. 


Ш; Ш: Н; Hi 
A 086 0.14 I 0.12 0.88 
B 0.97 0.03 ) 0.06 0.94 
С 0.86 0.14 K 0.01 0.99 
D 0.94 0.06 L 0.06 0.94 
E 0.99 0.01 M 0.14 0.86 
F 0.94 0.06 N 0.03 0.97 
G 0.88 0.12 о 0.14 0.86 
H 0.05 0.50 
6 
5 C(.86) о 
4 F(.94) 96) 
яз 8(97) Е(99) G(88) %(50) X12) KLOI) М(03) 
2 94) 3106) 
1 А86) М(14) 
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Figure 16 Results of applying the fuzzy k-means clustering algorithm to the data from 
Table 9. The values in parenthesis indicate the membership function for each 
object relative to group A 


1/31.04 
P2^ 7175.08) + (1/3104) 2:14 eo 

The sum (114 + род) is unity, which satisfies Equation (32), and the mem- 
bership functions for the other objects can be calculated in a similar manner. 
The process is repeated and after five iterations the total change in the squared 
pa; Values is less than 1075 and the membership functions are considered stable, 
Table 11. This result, Figure 16, accurately reflects the symmetric distribution 
of the data. 

The same algorithm that provides the membership functions for the test data 
can be used to generate values for interpolated and extrapolated data, and a 
three-dimensional surface plot produced, Figure 16(b). Going one stage 
further, we can combine pı; and py to provide the complete membership 
surface, according to the rule 
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This result is illustrated in Figure 17. 

Fuzzy clustering can be applied to the test data examined previously, from 
Table 7, and the results obtained when two, three, and four clusters are initially 
specified are provided in Table 12. As with using the K-means algorithm, 
although various parameters have been proposed to select the best number of 
clusters, no single criterion is universally accepted. Although the results of 
fuzzy clustering may appear appealing, we are still left with the need to make 
some decision as to which cluster an object belongs. This may be achieved by 
specifying some threshold membership value, a, in order to identify the core of 
the cluster. Thus, if say a = 0.5, then from Figure 16 objects A, B, C, D Е, Е, С 
belong to Cluster 1, I, J, K, L, M, N, Ocan be assigned to Cluster 2, and object 
H is an outlier from the two clusters. 


Table 12 Membership function values for the objects from Table 7 assuming two, 
three, and four clusters in the data 


Two clusters Three clusters Four clusters 

Ш; Mai Wu Boi Шз: Pas Mz: Mai Мм 
А 009 099 0.06 006 0.88 003 091 0.03 0.03 
В 0.83 0.17 0.87 0.08 0.05 0.83 0.05 0.09 0.02 
С 0.89 01 1.0 0.0 0.0 1.0 0.0 0.0 0.0 
р 089 01 0.90 007 0.03 0.9 002 00 0.01 
Е 0.04 0.96 0.03 0.00 0.94 0.00 0.90 0.02 0.05 
F 0.07 0.93 006 007 0.87 0.05 0.84 0.05 0.05 
G 001 0.99 0.01 002 0.97 003 0.82 0.04 0.11 
н 099 0.01 0.37 0.58 0.05 0.26 0.04 0.67 0.03 
І 0.89 0.11 0.01 0.98 0.01 0.02 0.01 0.96 0.01 
Ј 0.94 0.06 0.08 0.90 0.02 0.03 0.01 0.94 0.01 
К 0.79 0.21 0.04 0.94 0.02 0.05 0.03 0.86 0.05 
L 0.27 0.83 0.14 0.33 0.53 0.0 0.0 0.0 1.0 


тет. function 





Figure 17 The two cluster surface plot of data from Table 9 using the fuzzy clustering 
algorithm 
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Cluster analysis is justifiably a popular and common technique for explora- 
tory data analysis. Most commercial multivariate statistical software packages 
offer several algorithms, along with a wide range of graphical display facilities 
to aid the user in identifying patterns in data. Having indicated that some 
pattern and structure may be present in our data, it is often necessary to 
examine the relative importance of the variables and determine how the clusters 
may be defined and separated. This is the primary function of supervised 
pattern recognition and is examined in Chapter 5. 


СНАРТЕК 5 


Pattern Recognition IT: 
Supervised Learning 


1 Introduction 


Generally, the term pattern recognition tends to refer to the ability to assign an 
object to one of several possible categories, according to the values of some 
measured parameters. In statistics and chemometrics, however, the term is 
often used in two specific areas. In Chapter 4, unsupervised pattern recognition, 
or cluster analysis, was introduced as an exploratory method for data analysis. 
Given a collection of objects, each of which is described by a set of measures 
defining its pattern vector, cluster analysis seeks to provide evidence of natural 
groupings or clusters of the objects in order to allow the presence of patterns in 
the data to be identified. The number of clusters, their populations, and their 
interpretation are somewhat subjectively assigned and are not known before 
the analysis is conducted. Supervised pattern recognition, the subject of this 
chapter, is very different, and is often referred to in the literature as classifi- 
cation or discriminant analysis. With supervised pattern recognition, the 
number of parent groups is known in advance and representative samples of 
each group are available. With this information, the problem facing the analyst 
is to assign an unclassified object to one of the parent groups. A simple example 
will serve to make this distinction between unsupervised and supervised pattern 
recognition clearer. 

Suppose we have determined the elemental composition of a large number of 
mineral samples, and wish to know whether these samples can be organized 
into groups according to similarity of composition. Ав demonstrated in 
Chapter 4, cluster analysis can be applied and a wide variety of methods are 
available to explore possible structures and similarities in the analytical data. 
The result of cluster analysis may be that the samples can be clearly distin- 
guished, by some combination of analyte concentrations, into two groups, and 
we may wish to use this information to identify and categorize future samples as 
belonging to one of the two groups. This latter process is classification, and the 
means of deriving the classification rules from previously classified samples 
is referred to as discrimination. It is a pre-requisite for undertaking this 
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supervised pattern recognition that а suitable collection of pre-assigned objects, 
the training set, is available in order to determine the discriminating rule or 
discriminant function. 

The precise nature and form of the classifying function used in a pattern 
recognition exercise is largely dependent on the analytical data. If the parent 
population distribution of each group is known to follow the normal curve, 
then parametric methods such as statistical discriminant analysis can be usefully 
employed. Discriminant analysis is one of the most powerful and commonly 
used pattern recognition techniques and algorithms are generally available with 
all commercial statistical software packages. If, on the other hand, the distri- 
bution of the data is unknown, or known not to be normal, then non-parametric 
methods come to the fore. One of the most widely used non-parametric algo- 
rithms is that of K-nearest neighbours.* Finally, in recent years, considerable 
interest has been shown in the use of artificial neural networks for supervised 
pattern recognition and many examples have been reported in the analytical 
chemistry literature.? In this chapter each of these techniques is examined along 
with its application to analytical data. 


2 Discriminant Functions 


The most popular and widely used parametric method for pattern recognition is 
discriminant analysis. The background to the development and use of this 
technique will be illustrated using a simple bivariate example. 

In monitoring a chemical process, it was found that the quality of the final 
product can be assessed from spectral data using a simple two-wavelength 
photometer. Table 1 shows absorbance data recorded at these two wavelengths 
(400 and 560 nm) from samples of ‘good’ and ‘bad’ products, labelled Group A 
and Group B respectively. On the basis of the data presented, we wish to derive 
a rule to predict which group future samples can be assigned to, using the two 
wavelength measures. 

Examining the analytical data, the first step is to determine their descriptive 
statistics, i.e. the mean and standard deviation for each variable in Group A 
and Group B. It is evident from Table 1 that at both wavelengths Group A 
exhibits higher mean absorbance than samples from Group B. In addition, the 
standard deviation of data from each variable in both groups is similar. If we 
consider just one variable, the absorbance at 400 nm, then a first attempt at 
classification would assign the samples to groups according to this absorbance 
value. Figure 1 illustrates the predicted effect of such a scheme. The mean 


1 В.К. Lavine, in ‘Practical Guide to Chemometrics’, ed. S.J. Haswell, Marcel Dekker, New York, 
2 M iis "Classification Algorithms', Collins, London, UK, 1985. 

3 FJ. Manly, ‘Multivariate Statistical Methods: A Primer’, Chapman and Hall, London, UK, 
4 ware and V. Clark, ‘Computer-Aided Multivariate Analysis’, Lifetime Learning, California, 
5 " енді J. Gesteiger, "Neural Networks for Chemists', VCH, Weinheim, Germany, 1993. 
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Table 1 Absorbance measurements on two classes of material at 400 and 560 nm 


Good material (Group A): Bad material ( Group B): 
Sample 400 nm 560 nm Sample 400 nm 560 nm 
1 0.40 0.60 12 0.20 0.50 
2 0.45 0.45 13 0.20 0.40 
3 0.50 0.60 14 0.20 0.30 
4 0.50 0.70 15 0.25 0.40 
5 0.55 0.65 16 0.25 0.25 
6 0.60 0.50 17 0.30 0.30 
7 0.60 0.60 18 0.35 0.35 
8 0.60 0.70 19 0.40 0.30 
9 0.65 0.80 20 0.40 0.20 
10 0.70 0.60 21 0.50 0.10 
11 0.70 0.80 
Меап 0.568 0.636 0.305 0.310 
5 0.098 0.109 0.104 0.112 


discriminant line 
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Figure 1 The distribution of samples from Table 1 according to absorbance measure- 
ments at one single wavelength, 400 nm 


values and distribution of the sample absorbances at 400 nm are taken from 
Table 1, and it is clear that the use of this single variable alone is insufficient to 
separate the two groups. With the single variable, however, a decision or 
discriminant function can be proposed. 

For equal variances of absorbance data in Groups A and B, the discriminant 
rule is given by 
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assign sample to Group A if 

Absorbancesoonm > (XA + Ха)/2 (1) 
and assign to Group B if 

Absorbancegoonm < (Xa + Xg)/2 (2) 


i.e. a sample is assigned to the group with the nearest mean value. 

Having obtained such a classification rule, it is necessary to test the rule and 
indicate how good it is. There are several testing methods in common use. 
Procedures include the use of a set of independent samples or objects not 
included in the training set, the use of the training set itself, and the /еауе-опе- 
out method. The use of a new, independent set of samples not used in deriving 
the classification rule may appear the obvious best choice, but it is often not 
practical. Given a finite size of a data set, such as in Table 1, it would be 
necessary to split the data into two sets, one for training and one for validation. 
The problem is deciding which objects should be in which set, and deciding on 
the size of the sets. Obviously, the more samples used to train and develop the 
classification rule, the more robust and better the rule is likely to be. Similarly, 
however, the larger the validation set, the more confidence we can have in the 
rule's ability to discriminate objects correctly. 

The most common method employed to get around this problem is to use all 
the available data for training the classifier and subsequently test each object as 
if it were an unknown, unclassified sample. The inherent problem with using the 
training set as the validation set is that the total classification error, the error 
rate, will be biased low. This is not surprising as the classification rule would 
have been developed using this same data. New, independent samples may lie 
outside the boundaries defined by the training set and we do not know how the 
rule will behave in such cases. This bias decreases as the number of samples 
analysed increases. For large data sets, say when the number of objects exceeds 
10 times the number of variables, the measured apparent error can be con- 
sidered a good approximation of the true error. 

If the independent sample set method is considered to be too wasteful of 
data, which may be expensive to obtain, and the use of the training set for 
validation is considered insufficiently rigorous, then the leave-one-out method 
can be employed. By this method all samples but one are used to derive the 
classification rule, and the sample left out is used to test the rule. The process is 
repeated with each sample in turn being omitted from the training set and used 
for validation. The major disadvantage of this method is that there are as many 
rules derived as there are samples in the data set and this can be compu- 
tationally demanding. In addition, the error rate obtained refers to the average 
performance of all the classifiers and not to any particular rule which may 
subsequently be applied to new, unknown samples. 

The results of classification techniques examined in this chapter will be 
assessed by their apparent error rates using all available data for both training 
and validation, in line with most commercial software. 
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Table 2 Use of the contingency table, or confusion matrix, of classification 
results (а). E; is the number of objects from group i classified as j, Mia is 
the number of objects actually in group i, and M,, is the number classified 
in group i. The results using the single absorbance at 400 nm, (b), and at 


560 nm, (c) 
(a) Actual membership 
A B 
Predicted A EAA Epa Мас 
membership B EAB Еһв Мы 
М Аа М Ва 
(b) Actual membership 
A B 
Predicted A 10 1 11 
membership B 1 9 10 
11 10 
(с) Actual membership 
A B 
Predicted A 9 1 10 
membership B 2 9 11 
11 10 


The rules expressed by Equations (1) and (2) ensure that the probability of 
error in misclassifying samples is equal for both groups. In those cases for 
which the absorbance lies on the discriminant line, samples are assigned 
randomly to Group A or B. Applying this classification rule to our data results 
in a total error rate of 9%; two samples are misclassified. To detail how the 
classifier makes errors, the results can be displayed in the form of a contingency 
table, referred to as a confusion matrix, of actual group against classified group, 
Table 2. A similar result is obtained if the single variable of absorbance at 
560 nm is considered alone; three samples are misclassifed. 

In Figure 2, the distribution of each variable for each group is plotted along 
with a bivariate scatter plot of the data and it is clear that the two groups form 
distinct clusters. However, it is equally evident that it is necessary for both 
variables to be considered in order to achieve a clear separation. The problem 
facing us is to determine the best line between the data clusters, the discriminant 
function, and this can be achieved by consideration of probability and Bayes’ 
theorem. 


Bayes’ Theorem 


The Bayes’ rule simply states that ‘a sample or object should be assigned to that 
group having the highest conditional probability' and application of this rule to 
parametric classification schemes provides optimum discriminating capability. 
An explanation of the term ‘conditional probability’ is perhaps in order here, 
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Figure 2 The data from Table 1 as a scatter plot and, along each axis, the univariate 
distributions. Two distinct groups are evident from the data 


with reference to a simple example. In spinning a fair coin, the chance of tossing 
the coin and getting heads is 50%, i.e. 


Реза) = 0.5 (3) 


Similarly, the probability of tossing two coins resulting in both showing 
heads is given by 


P (bothheads) = Р, (heads) - P. (heads) ( 4) 
Pvothheads) = 0.25 


If, however, one coin is already showing heads, then the conditional prob- 
ability of spinning the other coin and both showing heads is now 


Pvothheads|onehead) = 0.5 (5) 


Which is to be read as ‘the probability of both coins being heads given that 
one coin is heads is 0.5’, i.e. the probability of an event is modified, for better or 
worse, by prior knowledge. 

Of course, if one coin displays tails then, 


P (both heads| one tail) 27 0.0 (6) 


Returning to our analytical problem, of the 21 samples analysed and listed in 
Table 1, over 50% (11 of the 21) are known to belong to Group A. Thus, in the 
absence of any analytical data it would seem reasonable to assign any unknown 
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sample to Group A as this has the higher probability of occurrence. With the 
analytical data presented in Table 1, however, the probability of any sample 
belonging to one of the groups will be modified according to its absorbance 
values at 400 and 560 nm. The absorbance values comprise the pattern vectors, 
denoted by x, where for each sample x, is the vector of absorbances at 400 nm 
and x; is the vector of absorbances at 560 nm. 

Expressed mathematically, therefore, and applying Bayes' rule, a sample is 
assigned to Group A, G(A), on the condition that, 


Pawn > Paw» (7) 


Unfortunately, to determine these conditional probability values, i.e. confirm 
that a particular group is characterized by a specific set of variate values, 
involves the analysis of all potential samples in the parent population. This is 
obviously unrealistic in practice, and it is necessary to apply Bayes’ theorem 
which provides an indirect means of estimating the conditional probability, 
Род)»: 

According їо Bayes' theorem, 


— Раиса» Poo» 2. (8) 


Pax) = 
*' Puicay: Paay + Poca: Pio» 


Pa) and Pray are the a priori probabilities, i.e. the probabilities of a 
sample belong to A and B in the absence of having any analytical data. 

Picus a conditional probability expressing the chance of a vector pattern 
x arising from a member of Group A, and this can be estimated by sampling the 
population of Group A. A similar equation can be arranged for P(cgj,, and 
substitution of Equation (8) into Equation (7) gives 


assign sample pattern to Group A if 


Paisa»: Pay > Pega»: Pica (9) 


The denominator term of Equation (8) is common to Paqay and Pia) and 
hence cancels from each side of the inequality. 

Although Роса) can be estimated by analysing large numbers of samples, 
similarly for Ров» the procedure is still time consuming and requires large 
numbers of analyses. Fortunately, if the variables contributing to the vector 
pattern are assumed to possess a multivariate normal distribution, then these 
conditional probability values can be calculated from 


1 
Pisay = 2т|Соз|/2 exp[ — 1/2(x — uA)! .CovA !.(х — pa)] (10) 


which describes the multidimensional normal distribution for two variables (see 
Chapter 1). Роду can, therefore, be estimated from the vector of Group А 
mean values, рд, and the group covariance matrix, Соул. 
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Substituting Equation (10), and the equivalent for Р, сву, in Equation (9), 
taking logarithms and rearranging leads to the rule 


assign sample pattern and object to Group A if, 
In Poa) = 0.51n(| Соул |) = 0.5(x = Ha)’. Cova ~! (x S ИА) > 
іп Pice) = 0.51n(| Covg |) = 0.5(x = HB)”. Cov, ! (x = вв) (1 1) 


Calculation of the left-hand side of this equation results іп a value for each 
object which is a function of x, the pattern vector, and which is referred to as 
the discriminant score. 

The discriminant function, d4(x) is defined by 


da(x) = 0.5In(| Cova |) + 0.5(x — pa)” . Cova ^! .(x — pA) (12) 
and substituting into Equation (11), the classification rule becomes 
assign to Group À if, 
In Pau) — da(x) > In Pray — Яв(х) (13) 


If the prior probabilities can be assumed to be equal, i.e. Prgcay = Pio, 
then the dividing line between Groups А and B is given by 


d(x) = (х) (14) 
and Equation (13) becomes 
assign object to Group А if, 
— da(x) > — dp(x) 
or 
da(x) < а(х) (15) 


The second term in the right-hand side of Equation (12) defining the discrimi- 
nant function is the quadratic form of a matrix expansion. Its relevance to our 
discussions here can be seen with reference to Figure 3 which illustrates the 
division of the sample space for two groups using a simple quadratic function. 
This Bayes' classifier is able to separate groups with very differently shaped 
distributions, i.e. with differing covariance matrices, and it is commonly refer- 
red to as the quadratic discriminant function. 

The use of Equation (15) can be illustrated by application to the data from 
Table 1. 
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x2 





quadratic function 
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х1 


Figure 3 Contour plots of two groups of bivariate normal аша and the quadratic division 
of the sample space 


From Table 1, the vector of variable means for each group is, 
0.568 0.305 
Ba = pel be = Hea (16) 
Тһе уагіапсе-соуагіапсе matrix for each group сап be determined by mean 


centring the data and pre-multiplying this modified matrix by its transpose, or 
calculated according to Equation (24) of Chapter 1. Thus, 


_ [0.010 0.005 _[ 0.011 -0.009 
сек = (0500 ТЕЙ Co - | _ ooo xd ч) 


and their inverse matrices, 


-1.| 136 -60 -1 |223 157 
СЕ | -60 109] Св = |157 190 5% 
The determinant of each matrix is 
|[Соуд| = 8.8 х 10-5  |Covs|= 5.7 x 107? (19) 


The discriminant functions, da(x) and dg(x), for each sample in the training 
set of Table 1 can now be calculated. 
Thus for the first sample, 


x- [049] ау | 0168] = [0095 
= 10.600] ~ #4 =] -00%| “7В710290 
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da(x) = 0.5 Іп(8.8 x 1075) 
+0.5[ -0.168 — 0.0361 


136 - | | - ү 
= — 3.03 


-60 109 — 0.036 


+ 0.5[0.095 0.290]. 


dp(x) = 0.5 In(5.7 x 10-5) | 
= 8.44 


223 | [|] 


157 190|'|0.290 (20) 


The calculated value for d(x) is less than that of dg(x) so this object is 
assigned to Group A. The calculation can be repeated for each sample in the 
training set of Table 1, and the results are provided in Table 3. All 21 samples 
have been classified correctly as to their parent group. The quadratic dis- 
criminating function between the two groups can be derived from Equation (14) 
by solving the quadratic equations for x. The result is illustrated in Figure 4 and 
the success of this line in classifying the training set is apparent. 


Table 3 Discriminant scores using the quadratic discriminant function as 
classifier (a), and the resulting confusion matrix (b) 


Sample da аһ Assigned Sample da de Assigned 


group group 
1 -3.03 8.44 A 12 2.60 -3.34 В 
2 -313 2.51 А 13 2.43 — 4.38 B 
3 - 4.43 16.23 А 14 3.36 -3.48 В 
4 —3.87 25.6 А 15 0.80 – 4.56 В 
5 — 4.62 25.88 А 16 3.04 — 3.69 В 
6 — 3.32 17.05 А 17 1.03 ~ 4.87 В 
7 - 4.46 26.25 А 18 - 0.68 – 4.23 В 
8 — 4.50 37.35 А 19 0.06 – 4.02 В 
9 — 3.55 57.77 А 20 327 —438 В 
10 — 3.12 38.50 А 21 9.17 -2.90 В 
11 — 3.31 65.74 А 
(6) Actual group 
A 
Predicted A 11 0 11 
membership B 0 10 10 
11 10 


Linear Discriminant Function 


A further simplification сап Бе made to the Bayes’ classifier if the covariance 
matrices for both groups are known to be or assumed to be similar. This 
condition implies that the correlations between variables are independent of the 
group to which the objects belong. Extreme examples are illustrated in Figure 5. 
In such cases the groups are linearly separable and a linear discriminant 
function can be evaluated. 
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Figure 4 Scatter plot of the data from Table 1 and the calculated quadratic discriminant 
function 


x2 


x1 


Figure 5 Contour plots of two groups of bivariate data with each group having identical 
variance—covariance matrices. Such groups are linearly separable 
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With the assumption of equal covariance matrices, the rule defined by 
Equation (11) becomes 


assign to Group A, if, 


In Рхау- 0.5(x — uA) (Соу (х — pa) > 
In Рскв) — 0.5(x — ив) (Сох (x — na) (21) 


where Cov = Cov, = Соув. 
Once again, if the prior probabilities are equal, Роду) = Pray, the classifi- 
cation rule is simplified, 


assign to Group A, if, 
— 0.5(х — нд) (Соу (x — pa) > — 0.5(x — ug) (Cov "(х — ps) (22) 
which by expanding out the matrix operations simplifies to 
assign to Group A if, 
(pA? Cov^! x) — 0.5(рАТ Соу рл) > (ua! Со» ^! x) – 0.5(ив! Соу рв) (23) 


Since uA! Соу! and pa’ Cov 'pa are constants (they contain no x terms), 
and similarly pg? Соу! and pp! Соу ! pg, then we can define 
Cai =a’ Соу !, Cups Соу! 
Cao = 0.5рА Соз pa, Сво = 0.5ив! Соу ив (24) 


and 


falx) = Caix — Cao 
Р(х) = Cii x — Cro Q5) 


The classification rule is now, assign an object to Group А if, 


fax) > fal) (26) 


Equations (25) are linear with respect to x and this classification technique is 
referred to as linear discriminant analysis, with the discriminant function 
obtained by least squares analysis, analogous to multiple regression analysis. 

Turning to our spectroscopic data of Table 1, we can evaluate the perform- 
ance of this linear discriminant analyser. 

For the whole set of data and combining all samples from both groups, 


0.028 0.021 


dn Re 0.040 


| and Cor =| 60.33 в] 


— 32.14 42.36 
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60.33 — ү 


с=з [056% osil 5032 42.36 || 0.636 


| = 0.689 


Ca; = [0.568 0.636] | 9033. 54 2 = [13.83 8.68] 
— 32.14 42.36 
(27) 
_ 60.33 —32.14|[0.305| _ 
Сво = 0.5 [0.305 0.310] | — 3214 sen] = 1.803. 
= 60.33 — 32.14) _ 
Св, = [0.305 0.310] | — 3214 з = [8.44 5.33] 
Substituting into Equations (25), for the first sample, 
fa(x)- [13.83 8.68] ІН - 6.689 - 4.052 
0.4 
fe(x) = [8.44 3.33] 06l ^ 1.803 = 3.57 (28) 


Since the value for f4(x) exceeds that for fa(x), from Equation (26) the first 
sample is assigned to Group A. The remaining samples can be analysed in a 
similar manner and the results are shown in Table 4. One sample, from Group 
A, is misclassified. The decision line can be found by solving for x when 
falx) = fa(x). This line is shown in Figure 6 and the misclassified sample can be 
clearly identified. 
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Figure 6 Scatter plot of the data from Table 1 and the calculated linear discriminant 
function 
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Table 4 Discriminant scores using the linear discriminant function as classifier 
(a), and the resulting confusion matrix (b) 


Sample fa fe Assigned Sample fa fa Assigned 


group group 
1 405 3.57 A 12 0.42 1.55 B 
2 344 349 B 13 —0.45 122 B 
3 543 441 A 14 — 1.32 0.88 B 
4 6.30 475 A 15 0.24 1.64 B 
5 6.54 5.00 A 16 - 1.06 1.14 В 
6 5.95 492 А 17 006 1.73 B 
7 6.82 526 A 18 119 231 В 
8 7.0 5.59 А 19 144 2.57 В 
9 925 6.34 А 20 0.57 224 В 
10 820 6.10 А 21 109 275 В 
11 994 67 А 
(b) Actual membership 
A B 
Predicted A 10 0 10 
membership B 1 10 11 
11 10 


As this linear classifier has performed less well than the quadratic classifier, it 
is worth examining further the underlying assumptions that are made in 
applying the linear model. The major assumption made is that the two groups 
of data arise from normal parent populations having similar covariance 
matrices. Visual examination of Figure 2 indicates that this assumption may 
not be valid for these absorbance data. The data from samples forming Group 
A display an apparent positive correlation (r= 0.54) between хі and х», 
whereas there is negative correlation (r = — 0.85) between the absorbance 
values at.the two wavelengths for those samples in Group B. For a more 
quantitative measure and assessment of the similarity of the two variance— 
covariance matrices we require some multivariate version of the simple F-test. 
Such a test may be derived as follows.$ 

For k groups of data characterized Буј = 1... m variables, we may compute 
k variance-covariance matrices, and for two groups A and B we wish to test the 
hypothesis 


Но Cov, = Cors 
against the alternative, 
Н,: Cov, з Cors (29) 


If the data arise from a single parent population, then a pooled variance- 
covariance matrix may be calculated from 


6 J.C. Davis, ‘Statistics and Data Analysis in Geology’, J. Wiley & Sons, New York, USA, 1973. 
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= ЭС 
nj — OV; 
Cov, = 57 = Dem, 
< (30) 
i=] 2 п| – К 
ізі : 
where n, is the number of objects or samples in group i. 
From Equation (30) a statistic, М, can be determined, 


м=|(У)-к|ш|с›„- Steu- Ошен о) 


ізі 


which expresses the difference between the logarithm of the determinant of the 
pooled variance-covariance matrix and the average of the logarithms of the 
determinants of the group variance-covariance matrices. The more similar 
the group matrices, the smaller the value of M. 

Finally a test statistic based on the x? distribution is generated from 


№ = М.С (32) 
where 
k 
Cnt ESD ar ea G3) 
i=l (Sn)-x 


i=l 


For small values of k and m, Davis reports that the x? approximation is 
good, and for our two-group, bivariate sample data the calculation of the x? 
value is trivial. 


qug p OBEN T а 
62-1)2-1) 110 9 21-2 (34) 
= 0.885 
апа 
М = (21 - 2)in|Cov,| — 10 In| Cov, | + 9 In| Со>»в| (35) 
= 42.1 
Thus, 
x? = 0.885 x 42.1 = 37.3 (36) 


with the degrees of freedom given by 


v = (1/2)(k — 1)(т)(т + 1) = 3 (37) 
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At а 5% level of significance, the critical value for x? from tables is 7.8. Ош 
value of 37.3 far exceeds this critical value, and the null hypothesis is rejected. 
We may assume, therefore, that the two groups of samples are unlikely to have 
similar parent populations and, hence, similar variance-covariance matrices. It 
is not surprising, therefore, that the linear discriminant analysis model was 
inferior to the quadratic scheme in classification. 

The linear discriminant function is a most commonly used classification 
technique and it is available with all the most popular statistical software 
packages. It should be borne in mind, however, that it is only a simplification of 
the Bayes' classifier and assumes that the variates are obtained from a multi- 
variate normal distribution and that the groups have similar covariance 
matrices. If these conditions do not hold then the linear discriminant function 
should be used with care and the results obtained subject to careful analysis. 

Linear discriminant analysis is closely related to multiple regression analysis. 
Whereas in multiple regression, the dependent variable is assumed to be a 
continuous function of the independent variables, in discriminant analysis the 
dependent variable, e.g. Group А or Group B, is nominal and discrete. Given 
this similarity, it is not surprising that the selection of appropriate variables to 
perform a discriminant analysis should follow a similar scheme to that 
employed in multiple regression (see Chapter 6). 

As with multiple regression analysis, the most commonly used selection 
procedures involve stepwise methods with the F-test being applied at each stage 
to provide a measure of the value of the variable to be added, or removed, in the 
discriminant function. The procedure is discussed in detail in Chapter 6. 

Finally it is worth noting that linear combinations of the original variables 
may provide better and more effective classification rules than the original 
variables themselves. Principal components are often employed in pattern 
recognition and are always worth examining. However, the interpretation of 
the classification rule in terms of relative importance of variables will generally 
be more confusing. 


3 Nearest Neighbours 


The discriminant analysis techniques discussed above rely for their effective use 
on a priori knowledge of the underlying parent distribution function of the 
variates. In analytical chemistry, the assumption of multivariate normal distri- 
bution may not be valid. A wide variety of techniques for pattern recognition 
not requiring any assumption regarding the distribution of the data have been 
proposed and employed in analytical spectroscopy. These methods are referred 
to as non-parametric methods. Most of these schemes are based on attempts to 
estimate Pixia) and include histogram techniques, kernel estimates and expansion 
methods. One of the most common techniques is that of K-nearest neighbours. 
The basic idea underlying nearest-neighbour methods is conceptually very 
simple, and in practice it is mathematically simple to implement. The general 
method is based on applying the so-called K-nearest neighbour classification 
rule, usually referred to as K-NN. The distance between the pattern vector of 
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Figure 7 Radius, ғ, of the circle about ап unclassified object containing three nearest 
neighbours, two from group A and one from group B. The unknown sample is 
assigned to group A 


the unclassified sample and every classified sample from the training set is 
calculated, and the majority of smallest distances, i.e. the nearest neighbours, 
determines to which group the unknown is to be assigned. The most common 
distance metric used is the Euclidean distance between two pattern vectors. 

For objects 1 and 2 characterized by multivariate pattern vectors х; and x; 
defined by 


ху = (X11, X12, - + Xim) 
X27 (X21, X22, « . 2) Хәм) (38) 


where т is the number of variables, the Euclidean distance between objects 1 
and 2 is given by 


m 1/2 
di; = [Sen] (39) 
)-1 


Application of Equation (39) to the К-ММ rule serves to define a sphere, or 
circle for bivariate data, about the unclassified sample point in space, of radius 
rg which is the distance to the КІЛ nearest neighbour, containing К nearest 
neighbours, Figure 7. It is the volume of this sphere which is used as an estimate 
of P, (|G) 

For a total training set of М objects comprised of n; samples known to belong 
to each group i, the procedure adopted is to determine the K'th nearest 
neighbour to the unclassified object defined by its pattern vector x, ignoring 
group membership. From this, the conditional probability of the pattern vector 
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arising from the group i, РІ,» is given by 


Zk; 1 


nj кх 





Р, (x|G) = (40) 


where k; is the number of nearest neighbours in group i and Ук „15 the volume 
of space which contains the K nearest neighbours. 

Using Equation (40) in the Bayes’ rule gives 

assign to group i if 


ki 1 k; 1 МЕРИ, 
Ро) "n Vk. > Р) п Ук. Гог all j zi (41) 


Since the volume term is constant to both sides of the equation, the rule 
simplifies to, 


assign to group i if 


Pat, fah, for all j # i (42) 
{ 7 


If the number of objects in each training set, n; is proportional to the 
unconditional probability of occurrence of the groups, Рс), then Equation (42) 
simplifies further to 


assign to group i if, 
к> kj (43) 


This is a common form of the nearest-neighbour classification rule and 
assigns a new, unclassified object to that group that contains the majority of its 
nearest neighbours. 

The choice of value for k is somewhat empirical and, for overlapping classes, 
k =3 or 5 have been proposed to provide good classification. In general, 
however, k = 1 is the most widely used case and is referred to as the 1-NN 
method or, simply, the nearest-neighbour method. 

For our bivariate, spectrophotometric data the inter-sample distance matrix, 
using the Euclidean metric, is given in Table 5. For each sample the nearest 
neighbour is highlighted. The confusion matrix summarizes the results, and 
once again, this time using the 1-NN rule, a single sample from Group À is 
misclassified. 

As well as the Euclidean distance, other metrics have been proposed and 
employed to measure similarities of pattern vectors between objects. One 
method used for comparing and classifying spectroscopic data is the Hamming 
distance. For two pattern vectors x, and x, defined by Equation (38), the 
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Table 5 The Euclidean distance matrix for the materials according to the two 
wavelengths measured (a), and the resulting confusion matrix after 
applying the k-NN classification algorithm 


(a) 
1 


0 
16 
10 
14 
16 
22 
20 
22 
32 
30 
36 
22 
28 
36 
25 
38 
32 
26 
30 
40 
51 


А 


(b) 


2 


16 

0 
16 
26 
22 
16 
21 
29 
40 
29 
43 
26 
26 
29 
21 
28 
21 
14 
15 
26 
35 


3 45 67 8 9 


10 
16 

0 
10 

7 
14 
10 
14 
25 
20 
28 
32 
36 
42 
32 
43 
36 
29 
32 
41 
50 


Assigned group 
ВА 


Predicted 


membership 


l4 
26 
10 

0 

7 
22 
14 
10 
18 
22 
22 
36 
42 
50 
39 


А 


16 
22 


22 
16 
14 
22 
16 

0 
10 
20 
30 
14 
32 
40 
41 
45 
36 
43 
36 
29 
28 
36 
41 


А 


20 
21 
10 
14 

7 
10 

0 
10 
21 
10 
22 
41 
45 
50 
40 
50 
42 
35 
36 
45 
51 


А 


22 
29 
14 
10 

7 
20 
10 

0 
11 
14 
14 
45 
50 
57 
46 
57 
50 
43 
45 
54 
61 


А 


> 


32 
40 
25 
18 
18 
30 
21 
П 

0 
21 

5 
54 
60 
67 
57 
68 
61 
54 
56 
65 
72 


А 


11 


36 
43 
28 
22 
21 
32 
22 
14 

5 
20 

0 
58 
64 
71 
60 
71 
64 
57 
58 
67 
73 


А 


12 


22 
26 
32 
36 
38 
40 
41 
45 
54 
51 
58 

0 
10 
20 
П 
26 
22 
21 
28 
36 
50 


13 


28 
26 
36 
42 
43 
41 
45 
50 
60 
54 
64 
10 

0 
10 

5 
16 
14 
16 
22 
28 
42 


14 


36 
29 
42 
50 
50 
45 
50 
57 
67 
57 
71 
20 
10 

0 
11 

7 
10 
16 
20 
22 
36 


15 


25 
21 
32 
39 
39 
36 
40 
46 
57 
49 
60 
11 

5 
11 

0 
15 
11 
11 
18 
25 
39 


16 
16 
29 


B B B B B 


Actual membership 


10 
1 
П 


0 
10 
10 


30 
16 


41 
38 
28 
36 
45 
56 
42 
58 


22 
20 
18 
16 
10 


10 
22 


20 


26 


41 


51 
47 
36 
45 
54 
65 
50 
67 
36 
28 
22 
25 
16 


16 
10 


14 


21 


51 
35 
50 


55 
41 
51 
61 
72 
54 
73 


42 
36 
39 
29 


29 
22 
14 


10 
11 


Hamming distance, Н, is simply the absolute difference between each element 
of one vector and the corresponding component of the other. 


Н- > (xi 7 хы) 
іші 


(44) 


When, say, infrared or mass spectra сап be reduced to binary strings indicat- 
ing the presence or absence of peaks or other features, the Hamming distance 
metric is simple to implement. In such cases it provides a value of differing bits 
in the binary pattern and is equivalent to performing the exclusive-OR function 
between the vectors. The Hamming distance is a popular choice in spectral 
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Sample Binary 
i 01 0.1 1 0 1 1 Code 
| | | | | | 01011011 
0 
XOR 
RI with sample 


1 
| | | | | 10101011 11110000 

0 

R2 1 
| | | | | 10010110 11001101 

0° 


Вз! 1 
| 10001011 11010000 
0 


RA 17 
01010011 00001000 
0 
R515 
| | | | 00111010 01100001 
0 


Figure 8 Binary representation of spectra data (1 — peak, 0 — no peak). The sample has 
smallest number of XOR bits set with reference spectrum R4, and this, therefore, 
is the best match 

















database ‘look-up and compare’ algorithms for identifying unknown spectra. 
Figure 8 provides a simple example of applying the method. 

Despite its relative simplicity, the nearest-neighbour classification method 
often provides excellent results and has been widely used in analytical science. 
Another advantage of the K-NN technique is that it is a multi-category method. 
It does not require repeated application to assign some unknown sample to a 
class as is often the case with binary classifiers. Its major disadvantage is that it 
is computationally demanding. For each classification decision, the distance 
between the sample pattern vector and every object in the training set for all 
groups must be calculated and compared. Where very large training sets are 
used, however, each distinct class or group can be represented by a few 
representative patterns to provide an initial first-guess classification before 
every object in the best classes is examined. 


4 The Perceptron 


As an approximation to the Bayes’ rule, the linear discriminant function 
provides the basis for the most common of the statistical classification schemes, 
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but there has been much work devoted to the development of simpler linear 
classification rules. One such method which has featured extensively in spectro- 
scopic pattern recognition studies is the perceptron algorithm. 

The perceptron is a simple linear classifier that requires no assumptions to be 
made regarding the parent distribution of the analytical data. For pattern 
vectors that are linearly separable, a perceptron will find a hyperplane (in two 
dimensions this is a line) that completely separates the groups. The algorithm is 
iterative and starts by placing a line at random in the sample space and 
examining which side of the line each object in the training set falls. If an object 
is on the wrong side of the line then the position of the line is changed to 
attempt to correct the mistake. The next object is examined and the process 
repeats until a line position is found that correctly partitions the sample space 
for all objects. The method makes no claims regarding its ability to classify 
objects not included in the training set, and if the groups in the training set are 
not linearly separable then the algorithm may not settle to a final stable result. 

The perceptron is a learning algorithm and can be considered as a simple 
model of a biological neuron. It is worth examining here not only as a classifier 
in its own right, but also as providing the basic features of modern artificial 
neural networks. 

The operation of a perceptron unit is illustrated schematically in Figure 9. 
The function of the unit is to modify its input signals and produce a binary 
output, 1 or 0, dependent on the sum of these inputs. Mathematically, the 
perceptron performs a weighted sum of its inputs, compares this with some 
threshold value and the output is turned on (output — 1) if this value is 
exceeded, else it remains off (output — 0). 

For m inputs, 


total input, Т= S w/ x; = w'x' (45) 
іші 


x' = (xi... Xm) represents an object's pattern vector, and м” = (и... Wm) is 
the vector of weights which serve to modify the relative importance of each 








weighted 
м; 


output 








input weighted 





weighted 
Ws 
input 
ІА 
Figure9 The simple perceptron unit. Inputs are weighted and summed and the output is 
‘I’ or ©” depending on whether or not it exceeds a defined threshold value 
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element of x. These weights аге varied as the model learns to distinguish 
between the groups assigned in the training set. 

The sum of the inputs, 7, is compared with a threshold value, Ө, and if 7 > да 
value of 1 is output, otherwise 0 is output, Figure 9. This comparison can be 
achieved by subtracting 0 from J and comparing the result with zero, i.e. by 
adding — 0 as an offset to 7. The summation and comparison operations can, 
therefore, be combined by modifying Equation (45), 


mtl 


total input, 7 = У WjX; = W.X (46) 
i=l 


where now w = (ил... Wm+ 1) With w,,4 1 being referred to as the unit's bias, and 
X=(X,... Xn41) With x,,41 = l. 
The resulting output, y, is given by 


у= fulw.x] (47) 
where /н is the Heaviside or step function defined by 


/н(х) =1,x>0 
fu(x) = 0, x <0 (48) 

The training of the perceptron as a linear classifier then follows the following 
steps, 


(а) randomly assign the initial elements of the weight vector, м, 

(b) present an input pattern vector from the training set, 

(c) calculate the output value according to Equation (47), 

(d) alter the weight vector to discourage incorrect decisions and reduce the 
classification error, 

(е) present the next object’s pattern vector and repeat from step (с). 


This process is repeated until all objects are correctly classified. 

Figure 10(a) illustrates a bivariate data set comprising two groups, each of 
two objects. These four objects are defined by their pattern vectors, including 
Xm+ 1, aS 


Al,x=[0.2 04 1.0] 
A2, x=[0.5 03 1.0] 


ВІ, х= [0.3 07 1.0] (49) 
B2, х= [0.8 0.8 1.0] 

and we take as our initial weight vector 
»=[10 -1.0 0.5] (50) 


Thus, our initial partition line, Figure 10(b), is defined by 
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© Group A 
= Group B 
(a) (b) 1st line 


Final line 





00 01 0.2 03 04 05 06 07 08 09 00 01 0.2 0.3 0.4 0.5 06 07 08 09 10 
хі х1 


Figure 10 А simple two-group, bivariate data set (а), and iterative discriminant analysis 
using the simple perceptron (b) 


ух + W2X2 + W4x4 = 0 


x, +0.5= x, (51) 


For ош first object, Al, the product of x and w is positive and the output is 1, 
which is a correct result. 


In, =W.x 
-П -1 0.5102 04 1.0] 
-(02 -04 +0.5) 
- 0.3 


апа 
Jai = }н(0.3) = 1 (52) 


For sample A2, the output is also positive and no change in the weight vector 
is required. For sample B1, however, an output of 1 is incorrect; B1 is not in the 
same group as Al and A2, and we need to modify the weight vector. The 
following weight vector adapting rule is simple to implement," 


7 R. Beale and T. Jackson, ‘Neural Computing: An Introduction’, Adam Hilger, Bristol, UK, 
1991. 
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(a) if the result is correct, then w(new) = w(old), 
(b) if y = 0 but should be y = 1, then w(new) = w(old) + x (53) 
(c) if y = 1 but should be y = 0, then w(new) = w(old) — x. 

Our perceptron has failed on sample B1: the output is 1 but should be 0. 
Therefore, from Equation (53c), 


w(new)=[1 -1 0.5]-[03 0.7 1.0] 
=[0.7 -1.7 -051 


(54) 
This new partition line is defined by 
0.7x, — 0.5 = 1.7x; 
and is illustrated in Figure 10(b). 


Sample B2 is now presented to the system; it is correctly classified with a zero 
output as Љ› is negative. We can now return to sample А1 and continue to 


Table 6 Calculations and results by iteratively applying the perceptron rule to 
the data illustrated in Figure 10(b) 


Sample Correct w W.X Calculated Result 
sign sign 

А1 + 1.0 = 1.0 0.5 0.30 + ҮЕ5 
А2 + 0.70 + ҮЕ$ 
ВІ - 0.10 + NO 
B2 - 0.7 -1.7 - 0.5 - 1.30 - YES 
А1 + - 1.04 - МО 
А2 + 0.9 -13 0.5 0.56 + YES 
ВІ - - 0.14 - ҮЕ$ 
В5 - 0.18 + МО 
А1 + 0.1 — 2.1 - 0.5 - 1.32 - МО 
А2 + 0.3 -17 0.5 0.14 + YES 
Bl - - 0.60 - YES 
B2 B - 0.62 - YES 
А1 + — 0.12 - МО 
А2 + 0.5 - 1.3 1.5 1.36 + YES 
Bi - 0.74 + МО 
B2 - 0.2 - 2.0 0.5 - 0.94 - ҮЕ$ 
А1 + — 0.26 - NO 
A2 * 0.4 — 1.6 1.0 0.72 + ҮЕ$ 
ВІ - 0.00 2 МО 
B2 - 0.1 -2.3 0 - 1.76 - YES 
Al + - 0.9 - NO 
A2 * 0.3 -19 1.0 0.56 + YES 
Bl - — 0.24 - YES 
B2 - - 0.28 - YES 
AI * 0.30 + YES 
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repeat the entire process until all samples are classified correctly. The full set of 
results is summarized in Table 6. The final weight vector is w = [0.3  — 1.9 
1.0] with the partition line being 


0.3x, + 1.0 = 1.9x; (55) 


This is illustrated in Figure 10(b) and serves to provide the correct classi- 
fication of the four objects. 

The calculations involved with implementing this perceptron algorithm are 
simple but tedious to perform manually. Using a simple computer program and 
analysing the two-wavelength spectral data from Table 1 a satisfactory par- 
tition line is obtained eventually and the result is illustrated in Figure 11. The 
perceptron has achieved a separation of the two groups and every sample has 
been rightly assigned to its correct parent group. 

Several variations of this simple perceptron algorithm can be found in the 
literature, with most differences relating to the rules used for adapting the 
weight vector. А detailed account can be found in Beale and Jackson, as well as 
a proof of the perceptron's ability to produce a satisfactory solution, if such a 
solution is possible." 

The major limitation of the simple perceptron model is that it fails drastically 
on linearly inseparable pattern recognition problems. For a solution to these 
cases we must investigate the properties and abilities of multilayer perceptrons 
and artificial neural networks. 


5 Artificial Neural Networks 


The simple perceptron model attempts to find a straight line capable of 
separating pre-classified groups. If such a discriminating function is possible 












1.0 4 
0.8 4 4. 
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0.2 4 d Ё 
" 
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Аш 


Figure 11 Partition of the data from Table 1 by a linear function derived from a simple 
perceptron unit 
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Figure 12 A simple two-group, bivariate data set that is not linearly separable by a single 
function. The lines shown are the linear classifiers from the two units in the first 
layer of the multilayer system shown in Figure 13 


then it will, eventually, be found. Unfortunately, there are many classification 
problems which are less simple or less tractable. 

Consider, for example, the two-group, four-object data set illustrated in 
Figure 12. Despite the apparent simplicity of this data set, it is immediately 
apparent that no single straight line can be drawn that will isolate the two 
classes of sample points. To achieve class separation and develop a satisfactory 
pattern recognition scheme, it is necessary to modify the simple perceptron. 

Correct identification and classification of sets of linearly inseparable items 
requires two major changes to the simple perceptron model. Firstly, more than 
one perceptron unit must be used. Secondly, we need to modify the nature of 
the threshold function. One arrangement which can correctly solve our four- 
sample problem is illustrated in Figure 13. Each neuron in the first layer 
receives its inputs from the original data, applies the weight vector, thresholds 
the weighted sum and outputs the appropriate value of zero or one. These 
outputs serve as inputs to the second, output layer. 

Each perceptron unit in the first layer applies a linear decision function 
derived from the weight vectors, 


w,=[4 -10 6) 
w,-[-10 4 6) (55) 


which serve to define the lines shown in Figure 12. The weight vector associated 
with the third, output, perceptron is designed to provide the final classification 
from the output values of perceptrons 1 and 2, 
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Figure 13 А two-layer neural network to solve the discriminant problem illustated in 
Figure 12. The weighting coefficients are shown adjacent to each connection 
and the threshold or bias for each neuron is given above each unit 


из= [1.5 L5 -2] (56) 


We can calculate the output from each perceptron for each sample presented 
to the input of the system. Thus for object A1, 


at perceptron 1, wx = [4 —10 6][02 08 1] 


= ~ 1.2 (57) 
Урі = 0 
at perceptron 2, wx 2[—10 4 6][02 0.8 1] 
= 7.2 (58) 
yp = 1 
at perceptron 3, wx = [1.5 15 -2][0 1 - 2] 
= —0.5 (59) 
Урз = 0 
Similar calculations can be performed for А2, ВІ, and B2: 
p! output p2 output p3 output 
А1 0 1 0 
А2 1 0 0 
ВІ 1 1 1 
B2 1 1 1 


Регсерітоп 3 is performing ап AND function оп the output levels from 
perceptrons 1 and 2 since its output is 1 only when both inputs are 1. 

Although the layout in Figure 13 correctly classifies the data, by applying 
two linear discriminating functions to the pattern space, it is unable to learn 
from a training set and must be fully programmed before use, i.e. it must be 
manually set-up before being employed. This situation arises because the 
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second layer is not aware of the status of the original data, only the binary 
output from the first layer units. The simple on-off output from layer one 
provides no measure of the scaling required to adjust and correct the weights of 
its inputs. 

The way around the non-learning problem associated with this scheme 
provides the second change to the simple perceptron model, and involves 
altering the nature of the comparison operation by modifying the threshold 
function. In place of the Heaviside step function described previously, a 
smoother curve such as a linear or sigmoidal function is usually employed, 
Figure 14. The input and output for each perceptron unit or neuron with such a 
threshold function will no longer be limited to a value of zero or one, but can 
range between these extremes. Hence, the signal propagated through the system 
carries information which can be used to indicate how near an input is to the 
full threshold value; information which can be used to regulate signal reinforce- 
ment by changing the weight vectors. Thus, the multilayer system is now 
capable of learning. 

The basic learning mechanism for networks of multilayer neurons is the 
generalized delta rule, commonly referred to as back propagation. This learning 
rule is more complex than that employed with the simple perceptron unit 
because of the greater information content associated with the continuous 
output variable compared with the binary output of the perceptron. 

In general, a typical back-propagation network will comprise an input stage, 
with as many inputs as there are variables, an output layer, and at least one 
hidden layer, Figure 15. Each layer is fully connected to its succeeding layer. 
During training for supervised learning, the first pattern vector is presented to 
the input stage of the network and the output of the network will be unpredict- 
able. This process describes the forward pass of data through the network and, 
using a sigmoidal transfer function, is defined at each neuron by 


(a) (b) (c) 


f(net) 
о 
1. 
f(net) 
о 
f(net) 
о 
1 








acp aS UM Sr т 
6-420 2 4 6 6 4-20 24 6 6 -4 -2 02 4 6 
input input input 


Figure 14 Some commonly used threshold functions for neural networks: the Heaviside 
function (a), the linear function (b), and the sigmoidal function (c) 
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х1 


х2 


x3 


x4 


input hidden output 
layer layer layer 


Figure 15 The general scheme for a fully connected two-layer neural network with four 


inputs 
9-1; (60) 
1+е7> 
where 
L= Yom; (61) 


О, is the output from neuron / апа J; is the summed input to neuron / from 
other neurons, O;, modified according to the weight of the connection, wj, 
between the 'th and 'th neurons, Figure 16. 

The final output from the network for our input pattern is compared with the 
known, correct result and a measure of the error is computed. In order to 
reduce this error, the weight vectors between neurons are adjusted by using the 
generalized delta rule and back-propagating the error from one layer to the 
previous layer. 

The total error, E, is given by the difference between the correct or target 
output, t, and the actual measured output, O, i.e. 


E- Y(y-0y (62) 
j 
and the critical parameter that is passed back through the layers of the network 
is defined by 


_ SE 
= a (63) 
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O, = f(l,) 
о, 


0, 


Figure 16 Considering a single neuron, number 5, in а non-input layer of the network, 
each of the four inputs, O, . . . O,, is weighted by a coefficient, w;s . . . Was. The 
neuron’s output, Os, is the summed value, I;, of the inputs modified by the 
threshold function, f(I) 


For output units the observed results can be compared directly with the target 
result, and 


6, =f; Ut; — О) (64) 


where // is the first derivative of the sigmoid, threshold function. 
If unit j is not an output unit, then, 


ò = 70) У бек (65) 


where the subscript k refers to neurons іп preceding layers. 

Thus the error is calculated first in the output layer and is then passed back 
through the network to preceding layers for their weight vector to be adapted in 
order to reduce the error. A discussion of Equations (63) to (65) is provided by 
Beale and Jackson,’ and is derived by Zupan.? 

A suitable neural network can provide the functions of feature extraction and 
selection and classification. The network can adjust automatically the weights 
and threshold values of its neurons during a learning exercise with a training set 
of known and previously categorized data. It is this potential of neural net- 
works to provide a complete solution to pattern recognition problems that has 
generated the considerable interest in their use. One general problem in apply- 
ing neural networks relates to the design of the topology of the neural network 
for any specific problem. For anything other than the most trivial of tasks there 
may exist many possible solutions and designs which can provide the required 
classification, and formal rules of design and optimization are rarely employed 
or acknowledged. In addition, a complex network comprising many hundreds 
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1-layer 


А 


2-layer 


3-layer 





structure XOR-problem meshed problem 


Figure 17 Neural network configurations and their corresponding decision capabilities 
illustrated with the XOR problem of Figure 12 and a more complex overlapping 
2-group example 
(Reproduced by permission from ref. 7) 





Figure 18 А neural network, comprising an input layer (I), a hidden layer (Н), and ап 
output layer (O). This is capable of correctly classifying the analytical data 
from Table 1. The required weighting coefficients are shown on each connec- 
tion and the bias values for a sigmoidal threshold function are shown above 
each neuron 
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or thousands of neurons will be difficult, if not impossible, to analyse in terms 
of its internal behaviour. The performance of a neural network is usually 
judged by results, often with little attention paid to statistical tests or the 
stability of the system. 

As demonstrated previously, a single-layer perceptron can serve as a linear 
classifier by fitting a line or plane between the classes of objects, but it fails with 
non-linear problems. The two-layer device, however, is capable of combining 
the linear decision planes to solve such problems as that illustrated in Figure 12. 
Increasing the number of perceptrons or neuron units in the hidden layer 
increases proportionally the number of linear edges to the pattern shape 
capable of being classified. If a third layer of neurons is added then even more 
complex shapes may be identified. Arbitrarily complex shapes can be defined by 
a three-layer network and such a system is capable of separating any class of 
patterns. This general principle is illustrated in Figure 17.6 

For our two-wavelength spectral data, a two-layer network is adequate to 
achieve the desired separation. A suitable neural network, with the weight 
vectors, is illustrated in Figure 18. 


СНАРТЕК 6 


Calibration and Regression 
Analysis 


1 Introduction 


Calibration is one of the most important tasks in quantitative spectrochemical 
analysis. The subject continues to be extensively examined and discussed in the 
chemometrics literature as ever more complex chemical systems are studied. 
The computational procedures discussed in this chapter are concerned with 
describing quantitative relationships between two or more variables. In par- 
ticular we are interested in studying how measured independent or response 
variables vary as a function a single so-called dependent variable. The class of 
techniques studied is referred to as regression analysis. 

The principal aim in undertaking regression analysis is to develop a suitable 
mathematical model for descriptive or predictive purposes. The model can be 
used to confirm some idea or theory regarding the relationship between vari- 
ables or it can be used to predict some general, continuous response function 
from discrete and possibly relatively few measurements. 

The single most common application of regression analysis in analytical 
laboratories is undoubtedly curve-fitting and the construction of calibration 
lines from data obtained from instrumental methods of analysis. Such graphs, 
for example absorbance or emission intensity as a function of sample con- 
centration, are commonly assumed to be linear, although non-linear functions 
can also be used. The fitting of some ‘best’ straight line to analytical data 
provides us with the opportunity to examine the fundamental principles of 
regression analysis and the criteria for measuring ‘goodness of fit’. 

Not all relationships can be adequately described using the simple linear 
model, however, and more complex functions, such as quadratic and higher- 
order polynomial equations, may be required to fit the experimental data. 
Finally, more than one variable may be measured. For example, multiwave- 
length calibration procedures are finding increasing applications in analytical 
spectrometry and multivariate regression analysis forms the basis for many 
chemometric methods reported in the literature. 


155 


156 Chapter 6 


2 Linear Regression 


It frequently occurs in analytical spectrometry that some characteristic, y, of a 
sample is to be determined as a function of some other quantity, x, and it is 
necessary to determine the relationship or function between x and y, which may 
be expressed as y — f(x). An example would be the calibration of an atomic 
absorption spectrometer for a specific element prior to the determination of the 
concentration of that element in a series of samples. 

A series of n absorbance measurements is made, y;, one for each of a suitable 
range of known concentration, х, The n pairs of measurements (x; у) can be 
plotted as a scatter diagram to provide a visual representation of the relation- 
ship between x and y. 

In the determination of chromium and nickel in machine oil by atomic 
absorption spectrometry the calibration data presented in Table 1 were 
obtained. These experimental data are shown graphically in Figure 1. 

At low concentrations of analyte and working at low absorbance values, a 
linear relationship is to be expected between absorbance and concentration, as 
predicted by Beer’s Law. Visual inspection of Figure 1(a) for the chromium 
data confirms the correctness of this linear function and, in this case, it is a 
simple matter to draw by hand a satisfactory straight line through the data and 
use the plot for subsequent analyses. The equation of the line can be estimated 
directly from this plot. In this case there is little apparent experimental 
uncertainty. In many cases, however, the situation is not so clear-cut. Figure 
1(b) illustrates the scatter plot of the nickel data. It is not possible here to draw 
a straight line passing through all points even though a linear relationship 


Table 1 Absorbance data measured from standard solutions of chromium and 
nickel by AAS (a). Calculation of the best-fit, least-squares line for the 
nickel data, (b) 


(a) 
Chromium concn. (mg Ке !): 0 1 2 3 4 5 (x) 
Absorbance: 001 оп 021 029 038 0.52 (y) 
Nickel concn. (mg Ке !): 0 1 2 3 4 5 (х) 
Absorbance: 002 012 0.14 032 038 049 (у) 
(b) 
For Nickel: x = 2.50 and y = 0.245 

Sum 
(x, — x): -250  -150 -0.50 050 150 2.50 
(yi у): —0.25 -0125 -0105 0.075 0.135 0.245 
(x; — xy yi у): 0.562 0187 0.052 0.037 0202 0.621 1.655 
(x, — х): 6.25 2.25 0.25 0.25 2.25 0.25 17.50 


Ь = 1.655/17.50 = 0.095 
а = 0.0075 
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(a) 











Absorbance 


Ni, mg kg* 


Figure 1 Calibration plots of chromium (а) and nickel (b) standard solutions, from data in 
Table 1. For chromium, a good fit can be drawn by eye. For nickel, however, a 
regression model should be derived, Table 1(b) 


between absorbance and concentration is still considered valid. The deviations 
in the absorbance data from expected, ideal values can be assumed to be due to 
experimental errors and uncertainties in the individual measurements and not 
due to some underlying error in the theoretical relationship. If multiple 
measurements of absorbance were made for each standard concentration, then 
a normal distribution for the absorbance values would be expected. These 
values would be centred on some mean absorbance value ),. The task for an 
analyst is to determine the ‘best’ straight line regressed through the means of 
the experimental data. 

The data set comprises pairs of measurements of an independent variable x 
(concentration) and a dependent variable y (absorbance) and it is required to fit 
the data using a linear model with the well known form, 


$i =а+ bx; (1) 


where а апа b are constant coefficients characteristic of the regression line, 
representing the intercept of the line with the y-axis and the gradient of the line 
respectively. The values of ў represent the estimated, model values of absorb- 
ance derived from this linear model. The generally accepted requirement for 
deriving the best straight line between x and j is that the discrepancy between 
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the measured data and the fitted line is minimized. The most popular technique 
employed to minimize this error between model and recorded data is the 
least-squares method. For each measured value, the deviation between the 
derived model value and the measured data is given by j; — y 

The total error between the model and observed data is the sum of these 
individual errors. Each error value is squared to make all values positive and 
prevent negative and positive errors from cancelling. Thus the total error, e, is 
given by, 


error, є= Y (fi- у)? (2) 


i=l 


The total error is the sum of the squared deviations. For some model defined 
by coefficients a and b, this error will be a minimum and this minimum point 
can be determined using partial differential calculus. 

From Equations (1) and (2) we can substitute our model equation into the 
definition of error, 


n 


e= Y (a+ bxi - yy (3) 


i=] 


The values of the coefficients a and b are our statistically independent 
unknowns to be determined. By differentiating with respect to а and 5 
respectively, then at the minimum, 


de OS ^ 
— = У 202+ бх, – у) = 0 
= 22а x,- y) 

(4) 
Be А оха + bxi- y=0 
8b = 1 І Ji 


where 4 and Ё are least squares estimates of the intercept, a, and slope, 5. 
Expanding and rearranging Equations (4) provides the two simultaneous 
equations, 


nâ + Èx; = Оу; (5) 
Ух, + bEx?- У( ух) 


from which the following expressions сап be derived, 


^ 


á-y-bx (6) 
and 


p; (xi - XXyi- y) 
m у(х; - Xy 0 
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where X and j represent the mean values of x and y.! : 

For the experimental data for Ni, calculation of à and b is trivial (a = 0.0075 
and 5 — 0.095) and the fitted line passes through the central point given by x, y, 
Figure 1(b). " 

Once values for â and b are derived, it is possible to deduce the concentration 
of subsequently analysed samples by recording their absorbances and substitut- 
ing the values in Equation (1). It should be noted, however, that because the 
model is derived for concentration data in the range defined by x; it is important 
that subsequent predictions are also based on measurements in this range. The 
model should be used for interpolation only and not extrapolation. 


Errors and Goodness of Fit 


It is often the case in chemical analysis that the independent variable, standard 
solution concentrations in the above example, is said to be fixed. The values of 
concentration for the calibration solutions can be expected to have been chosen 
by the analyst and the values to be accurately known. The errors associated 
with x, therefore, are negligible compared with the uncertainty in y due to 
fluctuations and noise in the instrumental measurement. 

To use Equations (6) and (7) in order to determine the characteristics of the 
fitted line, and employ this information for prediction, it is necessary to 
estimate the uncertainty in the calculated values for the slope, b, and intercept, 
4. Each of the absorbance values, у, has been used in the determination of â 
and 6 and each has contributed its own uncertainty or error to the final result. 

Estimates of error in the fitted line and estimates of confidence intervals may 
be made if three assumptions are valid, 


(a) the absorbance values are from parent populations normally distributed 
about the mean absorbance value, 

(b) the variance associated with absorbance is independent of absorbance, 
i.e. the data are homoscedastic, and, 

(c) the sample absorbance means lie on a straight line. 


These conditions are illustrated in Figure 2 which illustrates a theoretical 
regression line of such data on an independent variable. 

The deviation or residual for each of the absorbance values in the nickel data 
is given by y; — ў, i.e. the observed values minus the calculated or predicted 
values according to the linear model. The sum of the squares of these devi- 
ations, Table 2, is the residual sum of squares, and is denoted as SSp. The least 
squares estimate of the line can be shown to provide the best possible fit and no 
other line can be fitted that will produce a smaller sum of squares. 


SSp = є = X( y, — f)? = 0.00599 (8) 


The variance associated with these deviations will be given by this sum of 
squares divided by the number of degrees of freedom, 


1 C. Chatfield, ‘Statistics for Technology’, Chapman and Hall, London, UK, 1975. 
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Figure 2 A regression line through mean values of homoscedastic data 


ор? = SSp/(n — 2) = (yi - 5) /(n — 2) Q) 


The denominator, п — 2, is the residual degrees of freedom derived from the 
sample size, п, minus the number of parameters estimated for the line, 4 and b. 

The standard deviations or errors of the estimated intercept and slope values, 
denoted by o, and о; respectively, are defined by? 


_ 1 У х? 0.5 
Ca 7 D| X(x, - 3) 


(10) 
сь = op/[Z(x, - x? 
from which the confidence intervals, СТ, can be obtained, 
СКа) = 4+ to, 
and 
CKb) = b + to, (11) 


where ¢ is the value of the two-tailed ¢-distribution with (n - 2) degrees of 
freedom. Table 2 gives the results for С(а) and CKb) using 95% confidence 
intervals for the nickel absorbance data. 


2 J.N. Miller, Analyst, 1991, 116, 3. 
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Table 2 Errors and goodness of fit calculations associated with the linear regres- 
sion model for nickel AAS data from Table 1 


Nickel concn. (mg kg !): 0 1 2 3 4 5 
Absorbance (measured): 0.02 0.12 0.14 0.32 0.38 0.49 
Absorbance (estimated): 0.007 0.102 0.197 0.292 0.387 0.482 


SSp = 0.00599 
Sp = 0.0015 Sa = 0.028 Se = 0.0093 
СҚа) = 0.0075 + / — 0.078 CKb) = 0.095 + / — 0.026 


SS, = 0.177 55р = 0.171 r = 0.966 


How well the estimated straight line fits the experimental data сап be assessed 
by determining the coefficient of determination and the correlation coefficient. 

The total variation associated with the y values, 557, is given by the sum of 
the squared deviations of the observed y values from the mean y value, 


SS, = X(y, - ў) (12) 


This total variation comprises two components, that due to the residual or 
deviation sum of squares, SSp, and that from the sum of squares due to 
regression, 59р: 


SST = SSp + SSR (13) 


SSp is a measure of the failure of the regressed line to fit the data points, and 
SSg provides a measure of the variation in the regression line about the mean 
values. 

The ratio of 55р to SSr indicates how well the model straight line fits the 
experimental data. It is referred to as the coefficient of determination and its 
value varies between zero and one. From Equation (13), if SSp = 0 (the fitted 
line passes through each datum point) the total variation in y is explained by the 
regression line and SS; = 55р and the ratio is one. On the other hand, if the 
regressed line fails completely to fit the data, 55р is zero, the total error is 
dominated by the residuals, i.e. SS; ~ SSp, then the ratio is zero and no linear 
relationship is present in the data. 

The coefficient of determination is denoted by г, 


г? = 55қ/55т (14) 
and ғ? is the square of the correlation coefficient, ғ, introduced in Chapter 1. 


From our data of measured absorbance vs. nickel concentration, г = 0.966, 
indicating a good fit between the linear model and the experimental model. As 
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discussed in Chapter 1, however, care must be taken in relying too much on 
high values of r or r as indicators of linear trends. The data should be plotted 
and examined visually. 

In quantitative spectroscopic analysis an important parameter is the estimate 
of the confidence interval of a concentration value of an unknown sample, x,, 
derived from a measured instrument response. This is discussed in detail by 
Miller? and can be obtained from the standard deviation associated with x,, 


ср 1 1 ( Уа- yy 0.5 
шю = | +--+ “>з 15 
EQUUS р | n BPE- Х)2 (19) 
where y, is the mean absorbance of the unknown sample from m measurements. 
Thus, from a sample having a mean measured absorbance of 0.25 (from five 
observations), 


Txu) = 0.248 (16) 
and the 95% confidence limits of x, are 


Cl(x,) = 2.55 + 0.69 (17) 


Regression through the Origin 


Before leaving linear regression, a special case often encountered in laboratory 
calibrations should be considered. A calibration is often performed using not 
only standard samples containing known amounts of the analyte but also a 
blank sample containing no analyte. The measured response for this blank 
sample may be subtracted from the response values for each standard sample 
and the fitted line assumed, and forced, to pass through the origin of the graph. 

Under such conditions the estimated regression line, Equation (1), reduces to 


у= bx; 
and 
є = X(bx, — у)? (18) 


The resulting equation for 5, following partial differentiation, is 
b = X(xyMZx? (19) 


The option to use this model is often available in statistical computer 
packages, and for manual calculations the arithmetic is reduced compared with 
the full linear regression discussed above. A caveat should be made, however, 
since forcing the line through the origin assumes that the measured blank value 
is free from experimental error and that it represents accurately the true, mean 
blank value. 
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For the nickel data from Table 1, using Equation (19), b — 0.094, and the sum 
of squares of the deviations, SSp, is 0.00614. This value is greater than the 
computed value of S'Sp for the model using data not corrected for the blank, 
indicating the poorer performance of the model of Equation (18). 


3 Polynomial Regression 


Although the linear model is the model most commonly encountered in analy- 
tical science, not all relationships between a pair of variables can be adequately 
described by linear regression. A calibration curve does not have to approxi- 
mate a straight line to be of practical value. The use of higher-order equations 
to model the association between dependent and independent variables may be 
more appropriate. The most popular function to model non-linear data and 
include curvature in the graph is to fit a power-series polynomial of the form 


у=а+ђх+ сх? + ах +... (20) 


where, as before, у is the dependent variable, x is the independent variable to be 
regressed on y, and a, Б, c, d, etc. are the coefficients associated with each power 
term of x. 

The method of least squares was employed in the previous section to fit the 
best straight line to analytical data and a similar procedure can be adopted to 
estimate the best polynomial line. To illustrate the technique, the least squares 
fit for a quadratic curve will be developed. This can be readily extended to 
higher power functions.!? 

The quadratic function is given by 


у=а+Ьх+сх? (21) 
and the following simultaneous equations сап be derived: 


atl ^ bXx +с>х?=ЎУу 
aXx «Бух + с®х° = Lyx 


а> х? “БУХ + с®х* = Lyx? Q2) 
and in matrix notation, 
n Ух EZEx!||a 
Ух Ух? Ух?’ ||р| = тук (23) 
=x? x Ух*||с Lyx? 


which can be solved for coefficients a, b, and c. 
The extension of the technique to higher order polynomials, e.g. cubic, 
quartic, etc., is straightforward. Consider the general mth degree polynomial 


з А.Е. Carley and P.H. Morgan, ‘Computational Methods in the Chemical Sciences’, Ellis 
Horwood, Chichester, UK, 1989. 
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This expands to (т + 1) simultaneous equations from which (т + 1) co- 
efficients are to be determined. The terms on the right-hand side of the matrix 
equation will range from È y; to Z(x/". yj) and on the left-hand side from Z1 to 
Exf". 

A serious problem encountered with the application of polynomial curve- 
fitting is the fundamental decision as to which degree of polynomial is best. 
Visual inspection of the experimental data may indicate that a straight line is 
not appropriate. It may not be immediately apparent, unless theory dictates 
otherwise, whether say, a quadratic or cubic equation should be employed to 
model the data. As the number of terms in the polynomial is increased, the 
correlation between the experimental data and the fitted curve will also 
increase. In the limit, when the number of terms is one less than the number of 
the data points the correlation will be unity, i.e. the curve will pass through 
every point. Such a polynomial, however, may have no physical significance. In 
practice, statistical tests, based on the use of the F-ratio, can be employed to 
examine the significance of terms added to a polynomial and to indicate 
whether observed increases in the correlation coefficient are statistically sig- 
nificant. 

Table 3 shows results of recorded fluorescence emission intensity as a func- 
tion of concentration of quinine sulphate in acidic solutions. These data are 
plotted in Figure 3 with regression lines calculated from least squares estimated 
lines for a linear model, a quadratic model and a cubic model. The correlation 
for each fitted model with the experimental data is also given. It is obvious by 
visual inspection that the straight line represents a poor estimate of the associ- 
ation between the data despite the apparently high value of the correlation 
coefficient. The observed lack of fit may be due to random errors in the 
measured dependent variable or due to the incorrect use of a linear model. The 
latter is the more likely cause of error in the present case. This is confirmed by 
examining the differences between the model values and the actual results, 
Figure 4. With the linear model, the residuals exhibit a distinct pattern as a 
function of concentration. They are not randomly distributed as would be the 
case if a more appropriate model was employed, e.g. the quadratic function. 

The linear model predicts the relationship between fluorescence intensity, J, 
and analyte concentration, x, to be of the form, 


I,=a+t+ bx; + є; (25) 


Table 3 Measured fluorescence emission intensity as a function of quinine 


concentration 
Quinine concn. (mg ке !): 0 5 10 15 20 25 
Fluorescence intensity: 10 180 300 390 460 520 


(arb. units) 
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Figure 3 The linear (a), quadratic (b), and cubic (c) regression lines for the fluorescence 
data from Table 3 
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Figure 4 Residuals (y; — ў) as a function of concentration (x) for best fit linear and 
quadratic models 
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where e is a random error, assumed to be normally distributed, with a variance, 
c^, independent of the value of J. If these assumptions are valid and Equation 
(25) is a true model of the experimental data then the variance of e will be equal 
to the variance about the regression line. If the model is incorrect, then the 
variance around the regression will exceed the variance of e. These variances 
can be estimated using ANOVA and the F-ratio calculated to compare the 
variances and test the significance of the model. 

The form of the ANOVA table for multiple regression is shown in Table 4. 
The completed table for the linear model fitted to the fluorescence data is given 
in Table 5. This analysis of variance serves to test whether a regression line is 
helpful in predicting the values of intensity from concentration data. For the 
linear model we wish to test whether the line of slope b adds a significant 
contribution to the zero-order model. The null hypothesis being tested is, 


Hy: b=0 (26) 
i.e. the mean concentration value is as accurate in predicting emission intensity 


as the linear regression line. When the fitted line differs significantly from a 
horizontal (b = 0) line, then the term Х1- f; will be large relative to the 


Table 4 ANOVA table for multiple regression 


Source of Sum of Degrees of Mean | F-ratio 
variation Squares freedom Squares 

(SS) (df) (MS) 
Regression (ў – ў)? -P SSreg/P М5 / MS... 
Residuals (yi - Ў)? п-р- 1 $$ /(п — p — 1) 
(етгог) 
Total (у= Ў)? n-i SS / (n — 1) 


Table 5 ANOVA table for the linear regression model applied to the fluorescence 
data, emission intensity as a function of concentration 


Source of Sum of df Mean F-ratio 


variation Squares Squares 
Regression 163 206 1 163 206 74.8 
Residuals 8728 4 2182 
Total 171934 5 34 388 


I — 65.24 + 19.31х 
г? = 0.949 
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Table 6 ANOVA table for the quadratic regression model of fluorescence inten- 
sity as a function of concentration 


Source of Sum of df Mean F-ratio 


variation Squares Squares 

Regression 171807 2 85903 2038 
Residuals 126 3 42 

Total 171934 5 34387 


T= 14.64 + 34.49x — 0.61x? 
r? = 0.999 


residuals from the line, X(J/; — 1). As expected, this in fact is the case for the 
linear model, F; 4 = 74.8, compared with Р, 4 = 7.71 from tables for a 5% level 
of significance. So the null hypothesis is rejected, the linear regression model is 
significant, and the degree to which the regression equation fits the data can be 
evaluated from the coefficient of determination, r°, given by Equation (14). 

A similar ANOVA table can be completed for the quadratic model, Table 6. 
Does the addition of a quadratic term contribute significantly to the first-order, 
linear model? The equation tested is now 


І, =а+Ьх; + cx? + ei (27) 
and the null hypothesis is 
Hy b-c20 (28) 


Once again the high value of the F-ratio indicates the model is significant as a 
predictor. This analysis can now be taken a step further since the sum of the 
squares associated with the regression line can be attributed to two com- 
ponents, the linear function and the quadratic function. This analysis is accom- 
plished by the decomposition of the sum of squares, Table 7. The total sum of 
squares values for the regression can be obtained from Table 6 and that due to 


Table7 Sum of Squares decomposition for the quadratic model 


Source of Sum of df Mean F-ratio 
variation Squares Squares 


x 163 206 1 163 206 3873 
x 8601 1 8601 204 


Total 171 807 2 85 903 
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Table8 ANOVA table for the cubic regression model of fluorescence intensity as 
a function of concentration 


Source of Sum of df Mean F-ratio 


variation Squares Squares 

Regression 171901 3 57 300 3522 
Residuals 32 2 16 

Total 171933 5 34387 


I= 11.03 + 37.8x — 0.97x? + 0.009652 
г? = 0.999 


Table 9 Sum of Squares decomposition for the cubic model 


Source of Sum of df Mean F-ratio 
variation Squares Squares 

x 163 206 1 163 206 10031 
x 8601 1 8061 529 
х? 94 1 94 5.8 

Total 171901 3 57300 


the linear component, x, from Table 5. The difference is attributed to the 
quadratic term. The large F-value indicates the high significance of each term. 

The exercise can be repeated for the fitted cubic model, and the ANOVA 
table and sums of squares decomposition are shown in Tables 8 and 9 
respectively. In this case, the F-statistic for the cubic term (F = 5.8) is not 
significant at the 5% level. The cubic term is not required and we can conclude 
that the quadratic model is sufficient to describe the analytical data accurately, 
a result which agrees with visual inspection of the line, Figure 3(b). 

In summary the three models tested are 


T= 6524 + 19.31x 
I= 14.64 + 34.49x — 0.615? (29) 
I = 11.03 + 38.80x — 0.97x? + 0.00962 


The relative effectiveness and importance of the variables can be estimated 
from the relative magnitudes of the regression coefficients. This cannot be done 
directly on these coefficients, however, as their magnitudes are dependent on 
the magnitudes of the variables themselves. In Equation (29), for example, the 
coefficient for the cubic term is small compared with those for the linear and 
quadratic terms, but the cubic term itself may be very large. Instead, the 
standardized regression coefficients, B;, are employed. These are determined by 
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B, = Бкеск/ oy (30) 


where о; is the standard deviation of the variable x, and o, is the standard 
deviation of the dependent variable, y. 
For the cubic model, 


B, = bo, о, = 37.8 (9.35/185.4) = 1.91 
B; = co,2/0, = – 0.97 (243.6/185.4) = 1.27 (31) 
B; = 40,3/о, = 0.0096 (6143.5/185.4) = 0.32 


As expected, the relative significance of the standard regression coefficient B, 
is considerably less than those of the standardized linear and quadratic co- 
efficients, В, and By. 


Orthogonal Polynomials 


In the previous section, the fluorescence emission data were modelled using 
linear, quadratic, and cubic equations and the quadratic form was determined 
as providing the most appropriate model. Despite this, on moving to the higher, 
cubic, polynomial the coefficient of the cubic term is not zero and the values for 
the regression coefficients are considerably different from those obtained for 
the quadratic equation. In general, the least squares polynomial fitting pro- 
cedure will yield values for the coefficients which are dependent on the degree of 
the polynomial model. This is one of the reasons why the use of polynomial 
curve fitting often contributes little to understanding the causal relationship 
between independent and dependent variables, despite the technique providing 
a useful curve fitting procedure. 

With the general polynomial equation discussed above, the value of the first 
coefficient, a, represents the intercept of the line with the y-axis. The Р co- 
efficient is the slope of the line at this point, and subsequent coefficients are the 
values of higher orders of curvature. A more physically significant model might 
be achieved by modelling the experimental data with a special polynomial 
equation; a model in which the coefficients are not dependent on the specific 
order of equation used. One such series of equations having this property of 
independence of coefficients is that referred to as orthogonal polynomials. 

Bevington‘ presents the general orthogonal polynomial between variables y 
and x in the form 


у=а+ (x – В) + сх — (x — үз) + d(x — òx — 82)(x — ӧз) +... (32) 

As usual, the least squares procedure is employed to determine the values of 
the regression coefficients a, b, c, а, etc., giving the minimum deviation between 
the observed data and the model. Also, we impose the criterion that subsequent 


4 Р.К. Bevington, ‘Data Reduction and Error Analysis in the Physical Sciences’, McGraw-Hill, 
New York, USA, 1969. 
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addition of higher-order terms to the polynomial will not change the value of 
the coefficients of lower-order terms. This extra constraint is used to evaluate 
the parameters B, Y1, Y2, бі, etc. The coefficient a represents the average y value, 
b the average slope, c the average curvature, etc. 

In general, the computation of orthogonal polynomials is laborious but 
the arithmetic can be greatly simplified if the values of the independent variable, 
x, are equally spaced and the dependent variable is homoscedastic.* In this 
case, 





В= х 
"ИЕ 1195 
ү=В=А " T | 
(33) 
Зи? —7]5 
ЖҮРЕР 
where 
А = x73. (34) 
апа 
а= ӱ 
y 2x B) 
(х, BP 
(35) 


= yeu у): — y2] 
L(x — yO — үз]? 


g2 XL yx; - èx; — 62)(х; — 63)! 
210%; — 9))(x; — 82); - 8r 

Orthogonal polynomials are particularly useful when the order of the equa- 
tion is not known beforehand. The problem of finding the lowest-order poly- 
nomial to represent the data adequately can be achieved by first fitting a 
straight line, then a quadratic curve, then a cubic, and so on. At each stage it is 
only necessary to determine one additional parameter and apply the F-test to 
estimate the significance of each additional term. 

For the fluorescence emission data, 


В = 12.5 
yı = 21.04, y; = 3.96 (36) 
= 12.5, $, = 23.74, $; = 1.26 


and, from Equations (34), 


Calibration and Regression Analysis 171 











e. 
500 4 
400 | \ quadratic 
È | cubic 
5 300 
Е 
25. 
100 
0 CL; TS 1 
0 5 10 15 20 25 30 


Figure 5 Orthogonal linear, quadratic, and cubic models for the fluorescence intensity 
data from Table 3 


a = 306.7 

Ь= 19.31 

с= — 0.608 (37) 
4= 0.0089 


Thus the orthogonal linear equation is given by 
І, = 306.7 + 19.31(x; — 12.5) 
the quadratic by 
І, = 306.7 + 19.31(x; — 12.5) - 0.608(x; — 21.04)(x; — 3.96) 
and the cubic model by 


1, = 306.7 + 19.31(x; — 12.5) — 0.608(x; — 21.04)(x; — 3.96) (38) 
+ 0.0089(x; - 12.5)(х; — 23.74)(х; — 1.26) 

These equations are illustrated graphically in Figure 5. As before, an 
ANOVA table can be constructed for each model and the significance of each 
term estimated by sums of squares decomposition and comparison of standard 
regression coefficients. 


4 Multivariate Regression 


To this point, the discussion of regression analysis and its applications has 
been limited to modelling the association between a dependent variable and a 
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single independent variable. Chemometrics is more often concerned with multi- 
variate measures. Thus it is necessary to extend our account of regression to 
include cases in which several or many independent variables contribute to the 
measured response. It is important to realize at the outset that the term 
independent variables as used here does not imply statistical independence, as 
the x variables may be highly correlated. 

In the simplest example, the dependent response variable, y, may be a 
function of two such independent variables, х; and x2. 


у=а+ bix + 6х2 (39) 


Again a is the intercept on the ordinate y-axis, and b, and b, аге the partial 
regression coefficients. These coefficients denote the rate of change of the mean 
of y as a function of xj, with x; constant, and the rate of change of y as a 
function of x; with x; constant. 

Multivariate regression analysis plays an important role in modern process 
control analysis, particularly for quantitative UV-visible absorption spec- 
trometry and near-IR reflectance analysis. It is common practice with these 
techniques to monitor absorbance, or reflectance, at several wavelengths and 
relate these individual measures to the concentration of some analyte. The 
results from a simple two-wavelength experiment serve to illustrate the details 
of multivariate regression and its application to multivariate calibration pro- 
cedures. 

Figure 6 presents a UV spectrum of the amino acid tryptophan. For quanti- 
tative analysis, measurements at a single wavelength, e.g. 4,4, would be ade- 
quate if no interfering species are present. In the presence of other absorbing 
species, however, more measurements are needed. In Table 10 are presented the 
concentrations and measured absorbance values at А of seven standard 
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Figure 6 The UV spectra, recorded at discrete wavelengths, of tryptophan and tyrosine 
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Figure7 The least squares linear model of absorbance (414) vs. concentration of trypto- 
phan, data from Table 10 


Table 10 Absorbance values of tryptophan standard solutions recorded at two 
wavelengths, А, апа А». Three ‘unknown’ test solutions, ХІ, X2, and 
X3, are included with their true tryptophan concentration shown in 


parenthesis 
Tryptophan concn. Absorbance Absorbance 
(mg kg") Ам Ал 

0 0.0356 0.0390 

5 0.3068 0.2110 

10 0.3980 0.1860 

15 0.3860 0.0450 

20 0.6020 0.1580 

25 0.6680 0.1070 

29 0.8470 0.2010 

ХІ (7) 0.3440 0.2010 
Х2 (14) 0.3670 0.0500 
Х3 (27) 0.0810 0.2110 


solutions containing known amounts of tryptophan along with three samples 
which we will assume contain unknown amounts of tryptophan. АП solutions 
have unknown concentrations of a second absorbing species present, in this 
case the amino acid tyrosine. The effect of this interferent is to add noise and 
distort the univariate calibration graph, as shown in Figure 7. The best-fit linear 
regression line is also shown, as derived from 


Concentration tryptophan, Tr = — 2.00 + 38.5144 (40) 


where А, is the absorbance at \,. 
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Despite the apparently high ғ? value for this model (ғ? = 0.943), its predictive 
ability is poor as can be demonstrated with the three test samples: 


Actual: 7 14 27 mg Кр”! 
Predicted: 10.26 11.14 28.20 mgkg™! 


If a second term, say the absorbance at А1, is added to the model equation, 
the predictive ability is improved considerably. Thus by including 421, the 
least-squares model is 


Tr = - 0.00067 + 43.68414 — 39.7841 (41) 
with the test samples, 


Actual: 7 14 27 mg Кр! 
Predicted: 7.03 14.04 2699 mgkg™' 


This model as given by Equation (41) could be usefully employed for the 
quantitative determination of tryptophan in the presence of tyrosine. 

Of course, the reason for the improvement in the calibration model when the 
second term is included is that 42; serves to compensate for the absorbance due 
to the tyrosine since À;, is іп the spectral region of a tyrosine absorption band 
with little interference from tryptophan, Figure 6. In general, the selection of 
variables for multivariate regression analysis may not be so obvious. 


Selection of Variables for Regression 


In the discussions above and in the examples previously described, it has been 
assumed that the variables to be included in the multivariate regression equa- 
tion were known in advance. Either some theoretical considerations determine 
the variables or, as in many spectroscopic examples, visual inspection of the 
data provides an intuitive feel for the greater relevance of some variables 
compared with others. In such cases, serious problems associated with the 
selection of appropriate variables may not arise. The situation is not so simple 
where no sound theory exists and variable selection is not obvious. Then some 
formal procedure for choosing which variables to include in a regression 
analysis is important and the task may be far from trivial. 

The problems and procedures for selecting variables for regression analysis 
can be illustrated by considering the use of near-IR spectrometry for quantita- 
tive analysis. Despite its widespread use in manufacturing and process indus- 
tries, the underlying theory regarding specific spectral transitions associated 
with the absorption of radiation in the near-IR region has been little studied. 
Unlike the fundamental transitions observed in the mid-IR region, giving rise 
to discrete absorption bands, near-IR spectra are often characterized by 
overtones and combination bands and the observed spectra are typically 
complex and, to a large extent, lacking in readily identifiable features. It rarely 
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arises, therefore, that absorption at a specific wavelength can be attributed to a 
single chemical entity or species. For quantitative analysis a range of measure- 
ments, each at different wavelengths, must be recorded in order to attempt to 
correct for spectral interference. In the limit, of course, the whole spectrum can 
be employed as a list or vector of variates. The dependent variable, y, can then 
be represented by a linear model of the form 


у=а+ УЎђ,х; + є, (42) 


where у is the concentration of some analyte, x; is the measured response 
(absorbance or reflectance) at i specific wavelengths, and a and b are the 
coefficients or weights associated with each variate. For a complete spectrum, 
extending from say 1200 to 2000 nm, i may take on values of several hundreds 
and the solution of the possible hundreds of simultaneous equations necessary 
to determine the full range of the coefficients in order to predict y from the 
analytical data is computationally demanding. In preparing such a multivariate 
calibration model, therefore, it would be reasonable to address two key points. 
Firstly, which of the variates contribute most significantly to the prediction 
model and which variates can be left out without reducing the effectiveness of 


Table 11 UV absorbance data recorded at seven wavelengths, А... Аҙ), of 14 
solutions containing known amounts of tryptophan. The spectra of two 
test solutions containing 11 and 25 mg kg ^! tryptophan respectively 
are also included 


Tr Ay А, 445 А Ал А Ал 
(mg kg~') 


2 0.632 0.292 0.318 0.436 0.296 0.069 0.079 
4 0.558 0.275 0.418 0.468 0.258 0116 0.072 
6 0.565 0.300 0.392 0.501 0.279 0.040 0.052 
8 0.549 0.332 0.502 0.509 0.224 0.055 0.018 


10 0.570 0.351 0.449 0.480 0.222 0.056 0.025 

12 0.273 0309 0427 0.324 0.156 0.056 0.080 

14 0.276 0.378 0.420 0.265 0.063 0019 0.006 

16 0.469 0.444 0.550 0.456 0.181 0.063 0.053 

18 0.504 0.551 0.585 0.524 0.172 0.10 0.078 

20 0.554 0.566 0.654 0.513 0.168 0.070 0.083 

22 0.501 0.553 0.667 0.521 0143 0.103 0.035 

24 0.464 0.636 0.691 0.525 0.122 0077 0.100 

26 0.743 0.743 0.901 0.785 0.313 0.088 0.072 

28 0.754 0.793 0.939 0.773 0.261 0.024 0.095 

Меап 15 0.529 0.466 0.565 0.506 0.204 0.068 0.061 
5 8.367 0.139 0.174 0.188 0.139 0.072 0.030 0.030 
хі П 0.254 0.324 0.337 0337 0.10 0.035 0.034 


x2 25 0.497 0.656 0771 0.513 0.150 0.053 0.083 
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Table 12 Correlation matrix between tryptophan concentration and absorbance 
at seven wavelengths for the 14 standard solutions from Table 11 


Tr As Ay Ais Aig Án Аз Án 
Tr 1 
Ay 0.225 1 
Ai 0.955 0.474 1 
Ais 0.919 0.554 0.969 1 
Аз 0.589 0.877 0.765 0.832 1 
Ал - 0.228 0.830 0.020 0.135 0.615 1 
А, 0.015 0.250 0.083 0.108 0.220 0.207 1 
45 0.393 0.376 0.510 0.474 0.456 0.271 0.298 1 


the model? If most of the calibration information can be demonstrated to reside 
in only a few measurements then the computational effort is reduced consider- 
ably. Secondly, is there any penalty, other than increased data processing time, 
in having more variates in the set of equations than strictly necessary? After all, 
with the data processing power now available with even the most modest 
personal computer, why not include all measurements in the calibration? 

As an easily managed example of multivariate data analysis we shall consider 
the spectral data presented in Table 11. These data represent the recorded 
absorbance of 14 standard solutions containing known amounts of tryptophan, 
measured at seven wavelengths, in the UV region under noisy conditions and in 
the presence of other absorbing species. Two test spectra, X1 and X2, are also 
included. 

Some of these spectra are illustrated in Figure 8 and the variation in 
absorbance at each wavelength as a function of tryptophan concentration is 
shown in Figure 9. No single wavelength measure exhibits an obvious linear 
trend with analyte concentration and a univariate calibration is unlikely to 
prove successful. The matrix of correlation coefficients between the variables, 
dependent and independent, is given in Table 12. The independent variable 
most highly correlated with tryptophan concentration is the measured absor- 
bance at Ajo, Aj, i.e. 


Tr=a+b Ар (43) 
and by least-squares modelling, 
Tr = — 6.31 + 45.7443; (44) 


and for our two test samples, of concentrations 11 and 25 p.p.m., 
ХІ = 8.51 mg kg ^! and X2 = 23.69 mg kg '. 

The regression model vs. actual results scatter plot is shown in Figure 10 and 
the plot of residuals ( y; — Ӯ) in Figure 11. Despite the apparent high correlation 
between tryptophan concentration and 4, the univariate model is a poor 
predictor, particularly at low concentrations. 
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Figure 8 Some of the spectra from Table 11 
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Figure 9 Absorbance vs. tryptophan concentration at the seven wavelengths monitored 
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Figure 10 The predicted tryptophan concentration from the univariate regression model, 
using A)», vs. the true, known concentration. Prediction lines for test samples 
ХІ апа X2 are illustrated also 
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Figure 11 Residuals as a function of concentration for the univariate regression model, 
using A,; from Table 11 


In order to improve the performance of the calibration model other infor- 
mation from the spectral data could be included. The absorbance at А», for 
example, is negatively correlated with tryptophan concentration and may serve 
to compensate for the interfering species present. Including A», gives the 
bivariate model defined by 


Тт = а + Ь, А + bo А (45) 
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By ordinary least squares regression, Equation (45) can be solved to provide 
Tr = — 0.51 + 45.7541 — 28.4347 (46) 


with a coefficient of determination, ғ, of 0.970. The model vs. actual data and 
the residuals plot are shown in Figures 12 and 13. X1 and X2 are evaluated as 
11.19 and 25.24 mg Ер”! respectively. 

Although the bivariate model performs considerably better than the uni- 
variate model, as evidenced by the smaller residuals, the calibration might be 
improved further by including more spectral data. The question arises as to 
which data to include. In the limit of course, all data will be used and the model 

` takes the form 
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Figure 12 True and predicted concentrations using the bivariate model with А, and A, 
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Figure 13 Residuals as a function of concentration for the bivariate regression model, 
using A,; and A>, from Table 11 
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Tr = a+b, Ag + b; Ау + Ь3А\5 +. аса +b Ax (47) 
To determine by least-squares the value of each coefficient requires we use 


eight simultaneous equations. In matrix notation the normal equations can be 
expressed as, 


Tr=A.b (48) 
where 
21 ЎА e ХА 
EA; ХА) 244; 
А-|>Х4: ХАА УА, А» 
УА, У А> 


at = [a bi b, b; b4 bs bs b7] 
Tr! = [5 Tr > Ao Tr УА, Те УА, Тг УА, Ту УА, Tr УАТ" У А Ту] 
(49) 


Calculating the individual elements of matrix А and computing its inverse іп 
order to solve Equation (48) for Б can give rise to computational errors, and it is 
common practice to modify the calculation to achieve greater accuracy.? 

If the original data matrix is converted into the correlation matrix, then each 
variable is expressed in the standard normal form with zero mean and unit 
standard deviation. The intercept coefficient using these standardized variables 
will now be zero and the required value can be calculated later. The regression 
equation in matrix form is then 


R.B=r (50) 


where R is the matrix of correlation coefficients between the independent 
variables, r is the vector of correlations between the dependent variable and 
each independent variable, and B is the vector of standard regression coeffi- 
cients we wish to determine. 

The individual elements of R and r are available from Table 12 and we may 
calculate B by rearranging Equation (50): 


В= К,-'.ғ, (51) 
апа, 


В" -Г-0.28 0.709 0.419 – 0.006 —– 0.052 0.006 - 0.044] 


(displayed as a row vector for convenience only). 


5 J.C. Davis, ‘Statistics in Data Analysis in Geology’, J. Wiley and Sons, New York, USA, 1973. 
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То be used in a predictive equation these coefficients must be ‘unstandard- 
ized', and, from Equation (30), 


b; = Bo,/o; 
Hence 
bT =[— 16.83 34.08 18.67 -036 -6.08 171 – 12.38] (52) 
The constant intercept term is obtained from Equation (47), | 


a=y- byxi = bağa = b4X4 == b4 X4 = 5% т” be Xs =. 5X, 
= - 0.465 (53) 


Predicted regression results compared with known tryptophan concentration 
values are shown graphically in Figure 14, and Figure 15 shows the residuals. 
The calculated concentrations for ХІ and X2 аге 11.44 and 25.89 p.p.m. 
respectively. Although the predicted concentrations for our two test samples 
are inferior to the results obtained with the bivariate model, the full, seven- 
factor model fits the data better as can be observed from Figure 14 and the 
smaller residuals in Figure 15. Unfortunately, including all seven terms in the 
model has also added random noise to the system; 424 and A27 are measured at 
long wavelengths where negligible absorption would be expected from any 
component in the samples. In addition, where several hundred wavelengths 
may be monitored with a high degree of colinearity between the data, it is 
necessary and worthwhile using an appropriate subset of the independent 
variables. For predictive purposes it is often possible to do at least as well with a 
carefully chosen subset as with the total set of independent variables. Ав the 
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Figure 14 True and predicted concentrations using all variables from Table 11 
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Figure 15 Residuals as a function of concentration for the full regression model using all 
variables from Table 11 


number of independent variables increases the number of subsets of all possible 
combinations of variables increases dramatically and a formal procedure must 
be implemented to select the most appropriate variables to include in the 
regression model. А very direct procedure for testing the significance of each 
variable involves fitting all possible subsets of the variates in the equation and 
evaluating the best response. However, this is rarely possible. With p variables 
the total number of equations to be examined is 2”, if we include the equation 
containing all variates and that containing none. Even with only eight vari- 
ables, the number of equations is 256, and to examine a complete spectrum 
containing many hundreds of measures the technique is neither feasible nor 
practical. 

In some cases there may exist a strong practical or theoretical justification for 
including certain variables in the regression equation. In general, however, 
there is no preconceived assessment of the relative importance of some or all of 
the independent variables. One method, mentioned briefly previously, is to 
examine the relative magnitudes of the standard regression coefficients. For our 
experimental data, from В" (Equation (51)], this would indicate that Ao, A12, 
and А; are the most important. More sophisticated strategies are employed in 
computer software packages. For cases where there are a large number of 
significant variates, three basic procedures are in common use. These methods 
are referred to as the forward selection procedure, the backward selection 
procedure, and the stepwise method. 

The forward selection technique starts with an empty equation, possibly 
containing a constant term only, with no independent variables. As the pro- 
cedure progresses, variates are added to the test equation one at a time. The first 
variable included is that which has the highest correlation with the dependent 
variable y. The second variable added to the equation is the one with the highest 
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correlation with y, after y has been adjusted for the effect of the first variable, 
ie. the variable with the highest correlation with the residuals from the first 
step. This method is equivalent to selecting the second variable so as to 
maximize the partial correlation with y after removing the linear effect of the 
first chosen variable. The procedure proceeds in this manner until no further 
variate has a significant effect on the fitted equation. 

From Table 12, the absorbance at А): exhibits the highest correlation with 
tryptophan concentration and this is the first variable added to the equation, 
Equation (43). To choose the second variable, we could select 4,5 as this has the 
second highest absolute correlation with Tr but this may not be the best choice. 
Some other variable combined with 4); may give a higher multiple correlation 
than Ais and А. 

Multiple correlation represents the simple correlation between known values 
of the dependent variable and equivalent points or values as derived from the 
regression equation. Partial correlation, on the other hand, is the simple 
correlation between the residuals from the regression line or planes on the 
variable whose effects are removed. For our UV absorbance data we wish to 
remove the linear effect of 4,5 regressed оп Tr so that we can subsequently 
assess the correlations of the other variables. 

From Equation (44), for the univariate model using 412, 


Tr = — 6.31 + 45.7444; 
and regressing 412 оп to each of the remaining independent variables gives 


Ao = 0.32 + 0.40.4, 

Ais = 0.06 + 1.0742 
Ais = 0.21 + 0.60.4; 
Ад = 0.19 + 0.08412 
Ang = 0.06 + 0.01.4, 
457 = 0.02 + 0.084, 


(54) 


The matrix of residuals (Tr — Tr, Ao — Ао, 415 — Ais, etc.) is given in Table 
13, and the corresponding correlation matrix between these residuals in Table 
14. From Table 14 the variable having the largest absolute correlation with Tr 
residuals is Ay. Therefore we select this as the second variable to be added to the 
regression model. 

Hence, at step 2, 


Tr = — 0.60 + 52.104,» — 16.58.4 (55) 


Forward regression proceeds to step 3 using the same technique. The vari- 
ables Ay and А: are regressed on to each of the variables not in the equation 


6 A.A. Afifi and V. Clark, ‘Computer-Aided Multivariate Analysis’, Lifetime Learning, California, 
USA, 1984. 
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Table 13 Matrix of residuals for each variable after removing the linear model 


using An 
Tr - Tr As — % Ais — Ais Ais — Аһ Ад- А, Ам — Arg An~ А» 
- 5.330 0.205 — 0.054 0.051 0.083 0.005 0.036 
— 2.557 0.138 0.067 0.093 0.046 0.052 0.030 
— 1.694 0.135 0.011 0.111 0.065 - 0.024 0.008 
- 1.149 0.106 0.087 0.100 0.007 - 0.010 — 0.029 
— 0.013 0.120 0.013 0.059 0.004 — 0.009 — 0.023 
3.897 — 0.161 0.036 - 0.071 — 0.059 — 0.008 0.035 
2,759 - 0.185 - 0.044 ~ 0.172 — 0.157 -- 0.046 - 0.044 
1.757 - 0.019 0.015 — 0.020 - 0.044 - 0.003 - 0.002 
— 1.109 — 0.026 - 0.065 - 0.017 - 0.062 0.042 0.014 
0.208 0.018 — 0.012 — 0.037 - 0.067 0.002 0.018 
2.800 -- 0.030 0.015 — 0.021 - 0.091 0.035 — 0.029 
1.025 - 0.100 — 0.049 — 0.067 — 0.119 0.008 0.029 
— 1.842 0.136 0.046 0.129 0.064 0.018 — 0.007 
— 2.116 0.127 0.031 0.087 0.008 — 0.047 0.012 


Table 14 Matrix of correlations between the residuals from Table 13 


Tr — Tr %- 4, Ais — Аһ Ан- Ais An~ А, Aa ~ Ад Ay — Ay 


Tr - Tr 1 

4-4, -088 1 
Ais — Ais 0.08 0.34 1 
Аа Ав -0.73 091 057 1 
Ал — А — 0.81 0.92 0.42 0.91 1 
Ад – Ад — 0.15 0.12 0.01 0.16 0.11 1 
Ay – Any -0.35 0.15 — 0.18 0.11 0.29 0.29 1 


and the unused variable with the highest partial correlation coefficient is 
selected as the next to use. If we continue in this way then all variables will 
eventually be added and no effective subset will have been generated, so a 
stopping rule is employed. The most commonly used stopping rule in commer- 
cial programs is based on the F-test of the hypothesis that the partial corre- 
lation coefficient of the variable to be entered in the equation is equal to zero. 
No more variables are added to the equation when the 7-value is less than some 
specified cut-off value, referred to as the minimum F-to-enter value. 

A completed forward regression analysis of the UV absorbance data is 
presented in Table 15. Using a cut-off F-value of 4.60 (Ғ 14 at 95% confidence 
limit), three variables are included in the final equation: 


Tr = ~ 0.77 + 33.9241; — 19.47 A5 + 18.054,55 (56) 


The predicted vs. actual data are illustrated in Figure 16 and the residuals 
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Table 15 Forward regression analysis of the data from Table 11. After three 
steps no remaining variable has a F-to-enter value exceeding the 
declared minimum of 4.60, and the procedure stops 


Step 1: Variable entered: А, 
Dependent variable Tr 
Variables in equation Constant 
Coefficient — 6.31 
Variables not in equation Ay 
Partial correlation coefficient - 0.88 
F-to-enter 43.20 
Step 2: Variable entered: 4; 
Variables in equation Constant 
Coefficient - 0.60 
Variables not in equation Ais 
Partial correlation coefficient 0.65 
F-to-enter 8.84 
Step 3: Variable entered: Ais 
Variables in equation Constant 
Coefficient - 0.77 
Variables not in equation Ав 
Partial correlation coefficient - 0.07 
0.06 


An 
45.74 
Ais Ais 
-010 -0.74 
0.14 16.19 
А 
52.10 
Ав Ал 
025 -010 
0.87 0.12 
An 
33.92 
Ал Ад 
— 0.30 – 0.03 
1.10 0.01 


А А 
— 0.83 – 0.22 
29.87 0.64 
4 
- 16.58 
Ал A2 
- 0.01 -0.43 
0.03 2.66 
As 
— 19.47 
An 
- 0.40 
241 


А» 
~ 0.36 
1.90 


Ais 
18.05 


F-to-enter 





True 


Figure 16 True and predicted concentrations using three variables (Ag, An, and А.) 


from Table 11 
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Figure 17 Residuals as a function of concentration for three variable regression model 
from forward regression analysis 


plotted in Figure 17. Calculated values for X1 and X2 are 11.30 and 25.72 mg 
Кр”! respectively. 

An alternative method is described by backward elimination. This technique 
starts with a full equation containing every measured variate and successively 
deletes one variable at each step. The variables are dropped from the equation 
on the basis of testing the significance of the regression coefficients, i.e. for each 
variable is the coefficient zero? The F-statistic is referred to as the computed 
F-to-remove. The procedure is terminated when all variables remaining in the 
model are considered significant. 

Table 16 illustrates a worked example using the tryptophan data. Initially, 
with all variables in the model, Aig has the smallest computed F-to-remove 
value and this variable is removed from the model and eliminated at the first 
step. The procedure proceeds by computing a new regression equation with the 
remaining six variables and again examining the calculated F-to-remove values 
for the next candidate for elimination. This process continues until no variable 
can be removed since all F-to-remove values are greater than some specified 
maximum value. This is the stopping rule; F-to-remove = 4 was employed here. 

It so happens in this example that the results of performing backward 
elimination regression are identical with those obtained from the forward 
regression analysis. This may not be the case in general. In its favour, forward 
regression generally involves a smaller amount of computation than backward 
elimination, particularly when many variables are involved in the analysis. 
However, should it occur that two or more variables combine together to be a 
good predictor compared with single variables, then backward elimination will 
often lead to a better equation. 

Finally, stepwise regression, a modified version of the forward selection 
technique, is often available with commercial programs. As with forward 
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Table 16 Backward regression analysis of the data from Table 11. After four 
steps, three variables remain in the regression equation; their F-to- 
remove values exceed the declared maximum value of 4.0 


Step 0: АП variables entered 
Dependent variable Tr 


Variables 

in equation | Constant As А Ais Ais А, A3 Ay 
Coefficient - 0.45  — 16.83 34.08 18.67 -0.36 -6.08 1.71 -— 12.38 
F-to-Remove 5.96 12.43 5.03 0.001 0.08 0.04 0.78 
Step 1: Remove A, 

Variables 

in equation Constant As Ар Ais Ал Ам А 
Coefficient — 0.62 - 16.45 34.85 1712 -4483 209 —13.25 
F-to-Remove 6.92 16.54 6.32 0.17 0.04 1.01 
Step 2: Remove A, 

Variables in equation Constant Ay Ai Ais А, А» 
Coefficient — 0.53 - 16.14 34.50 1728 —532 -12.35 
F-to-Remove 7.83 1867 723 0.23 1.09 
Step 3: Remove 4; 

Variables in equation Constant As А» Ais А» 

Coefficient — 0.62 - 18.69 36.67 16.37  — 14.88 
F-to-Remove 73.03 3329 7.4 2.12 

Step 4: Remove 4;; 

Variables in equation Constant Ay Ay Ais 

Coefficient — 0.77 —19.47 33.92 18.05 

F-to-Remove 77.08 25.89 8.84 


selection, the procedure increases the number of variables in the equation at 
each step but at each stage the possibility of deleting a previously included 
variable is considered. Thus, a variable entered at an earlier stage of selection 
may be deleted at subsequent, later stages. 

It is important to bear in mind that none of these subset multiple linear 
regression techniques are guaranteed or even expected to produce the best 
possible regression equation. The user of commercial software products is 
encouraged to experiment. 


Principal Components Regression 


It is often the case with multiple regression analysis involving large numbers of 
independent variables that there exists extensive colinearity or correlation 
between these variables. Colinearity adds redundancy to the regression model 
since more variables may be included in the model than is necessary for 
adequate predictive performance. Of the methods available to the analytical 
chemist for regression analysis with protection against the problems induced by 
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correlation between variables, principal components regression, PCR, is the 
most common employed. 

Having discussed in the previous section the problems associated with vari- 
able selection, we may now summarize our findings. The following rules-of- 
thumb provide a useful guide: 


(a) Select the smallest number of variables possible. Including unnecessary 
variables in our model will introduce bias in the estimation of the 
regression coefficients and reduce the precision of the predicted values. 

(b) Use the maximum information contained in the independent variables. 
Although some of the variables are likely to be redundant, potentially 
important variables should not be discounted solely in order to reduce 
the size of the problem. 

(c) Choose independent variables that are not highly correlated with each 
other. Colinearity can cause numerical instability in estimating regres- 
sion coefficients. 


Although subset selection along with multiple linear regression provides a 
means of reducing the number of variables studied, the method does not 
address the problems associated with colinearity. To achieve this, the regression 
coefficients should be orthogonal. The technique of generating orthogonal 
linear combinations of variables in order to extract maximum information from 
a data set was encountered previously in eigen analysis and the calculation of 
principal components. The ideas derived and developed in Chapter 3 can bé: 
applied here to regression analysis. 

As an example consider the variables Я);, А) and 4з from the UV 
absorbance data of Table 11. These three variables are highly correlated 
between each other as can be seen from Table 12. This intercorrelation can also 
be observed in the scatter plot of these variables, Figure 18. By principal 


As 








Figure 18 Scatter plot of absorbance data at three wavelengths, Aj2, Ais, and Аз, from 
Table 11. The high degree of colinearity, or correlation, between these data is 
evidenced by their lying on a plane and not being randomly distributed in the 
pattern space 
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Figure 19 The first principal component, PCI, from 4);, A,;, and А, vs. tryptophan 
concentration 


components analysis two new variables can be defined containing over 9996 of 
the original variance of the three original variables. The first principal com- 
ponent alone accounts for over 90% of the total variance and a plot of Tr 
against PCI is shown in Figure 19. 

The use and application of principal components in regression analysis has 
been extensively reported in the chemometrics literature."!? We can calculate 
the principal components from our data set, so providing us with a set of new, 
orthogonal variables. Each of these principal components will be a linear 
combination of, and contain information from, all the original variables. By 
selecting an appropriate subset of principal components, the regression model 
is reduced whilst having the relevant information from the original data. The 
PCR technique described here follows the methodology described by Martens 
and Naes!! and is applied to the data from Table 11. The original data are 
preprocessed by mean-centring. The variance-covariance dispersion matrix is 
then computed and from this square, symmetric matrix we calculate the norma- 
lized eigenvalues and eigenvectors. From each eigenvector, the principal com- 
ponent scores are determined, and by conventional regression analysis the 
calibration model is developed. The stepwise procedure is illustrated in Table 
17 and we will now follow the steps involved. 


7 E. V. Thomas and D.M. Haaland, Anal. Chem., 1990, 62, 1091. 

3 R.G. Brereton, ‘Chemometrics’, Ellis Horwood, Chichester, UK, 1990. 

9 J.H. Kalivas, in "Practical Guide to Chemometrics', ed. S.J. Haswell, Marcel Dekker, New York, 
USA, 1992. 

10 P.S. Wilson, in ‘Computer Methods in UV, Visible and IR Spectroscopy’, ed. W.O. George and 
H.A. Willis, Royal Society of Chemistry, Cambridge, UK, 1990. 

11 Н. Martens and Т. Naes, ‘Multivariate Calibration’, J. Wiley and Sons, Chichester, UK, 1991. 
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The general linear regression equation is given by 
уша-Х.Ь (57) 


where b is the vector of estimates of the regression coefficients to be determined. 
This vector b can be written as the product of the eigenvectors and the 
y-loadings, 


b=P.q (58) 


where P is the matrix of eigenvectors, or loadings. Each column of P is an 
eigenvector for each factor included in the regression model. Elements of the P 
matrix are py (j= 1... т, the number of original variables, and k = 1... К, 
the number of factors or principal components used in the model). The vector q 
represents the y-loadings which сап be determined by regression of y on T, the 
matrix of scores, 4, for each principal component. Martens апа Naes derive 4 
from 


4-В.ТТ.у (59) 


where D is a diagonal matrix, with each diagonal element equal to 1/7, (т, = the 
eigenvalue of factor K). 

Working through our example in Table 17 will serve to illustrate the tech- 
nique in operation. The mean-centred transformed data (x; — x and y; — y) is 
presented as the matrix Хо and vector yo. The eigenvalues of Хо are provided in 
Table 18 and it is evident that the data can be adequately described by two or 
three principal components. The eigenvector corresponding to the first prin- 
cipal component is pi, 


pi -[-0.334 —0.555 — 0.614 — 0.442 —0.076 —0.01 — 0.045] (60) 


and the greatest weights are given to Ao, А2, 415, and 4s. 
The vector of scores, 4, is obtained from, 


tj = Хо.р 


With only one principal component in the model, T = t; and D = 1/1, = 0.879; 
thus, 


4 = 0.8791 yo = — 23.405 (61) 


and the estimated regression coefficients for the one-factor model, from Equa- 
tion (58), are 


8-рі.41 (62) 
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Table 18 Eigenvalues of the mean-centred original data (Table 11). Over 97% 
of the original variance can be accounted for in the first two principal 


components 

Eigenvalue Percentage Cumulative 

contribution percentage 

contribution 
1.138 79.08 79.08 
0.263 18.28 97.36 
0.017 1.18 98.54 
0.014 0.97 99.51 
0.003 0.21 99.72 
0.002 0.14 - 99.86 
0.002 0.14 100.00 

with a constant a given by 
- T 
а=у-х .Б (63) 


Therefore, 


Tr(single factor) = — 8.99 + 7.824, + 12.994). + 14.3745 
+ 10.344; 11.7845, + 0.2444 + 1.0645, (64) 


The actual vs. predicted tryptophan concentration values employing the 
single factor are shown in Figure 20. 
The algorithm proceeds by subtracting the effect of the first factor from X, 


Predicted 
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Figure 20 True vs. predicted tryptophan concentration using only the first principal 
component in the regression model 
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Figure 21 True vs. predicted tryptophan concentration using the first two principal 
components in the regression model 


and yo, yielding the residual matrix X, and vector yı. The process is repeated 
with the second scores determined from the second eigenvector and a two- 
factor model developed. Figure 21 shows comparative results. 


Tr(two factor) = — 0.54 — 13.964, + 25.404, + 24.094, 
+ 0.3541 — 13.2947, — 0.52424 + 0.8347 (65) 


and if a further factor is included, 


Tr(three factor) = — 0.571 - 12.954» + 26.8641; + 22.8345 
— 0.49 4,5 — 13.8442) — 0.2045, + 1.5647; (66) 


with predicted vs. actual data shown in Figure 22. 

For the one, two, and three factor models the sums of the squares of the y 
residuals are 337, 11.6, and 11.6 respectively and the predicted concentrations 
for the test samples X, and X; are, 


X, (11 mg квт!) Х Q5 mg Ке” !) 
One factor 5.78 20.18 
Two factors 10.93 25.98 
Three factors 10.90 26.00 


Thus, as anticipated from visual examination of the eigenvalues, two factors 
are sufficient to describe the calibration and the regression model. 

In employing principal components as our regression factors we have suc- 
ceeded in fully utilizing all the measured variables and developed new, un- 
correlated variables. In selecting which eigenvectors to use, the first employed 
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30 





True 


Figure 22 True vs. predicted tryptophan concentration using the first three principal 
components in the regression model 


is that corresponding to the largest eigenvalue, the second that corresponding 
to the next largest eigenvalue, and so on. This strategy assumes that the major 
eigenvectors correspond to phenomena in the X data matrix of relevance in 
modelling the dependent variable y. Although this is generally accepted as 
being the case for most analytical applications, another data compression 
method can be employed if variables having high variance but little relevance 
to y are thought to be present. This next method is partial least squares 
regression. 


Partial Least Squares Regression 


The calibration model referred to a partial least squares regression (PLSR) is a 
relatively modern technique, developed and popularized in analytical science 
by Wold.!?!? The method differs from PCR by including the dependent vari- 
able in the data compression and decomposition operations, i.e. both y and x 
data are actively used in the data analysis. This action serves to minimize the 
potential effects of x variables having large variances but which are irrelevant 
to the calibration model. The simultaneous use of X and y information makes 
the method more complex than PCR as two loading vectors are required to 
provide orthogonality of the factors. 

The method illustrated here employs the orthogonalized PLSR algorithm 
developed by Wold and extensively discussed by Martens and Naes.!! 

As with PCR, the dependent and independent variables are mean centred to 


12 Н. Wold, in ‘Perspectives in Probability and Statistics’, ed. J. Gani, Academic Press, London, 
UK, 1975. 

із H. Wold, in ‘Encyclopaedia of Statistical Sciences’, ed. N.L. Johnson and S. Kotz, J. Wiley and 
Sons, New York, USA, 1984. 
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give data matrix Хо and vector yo. Then for each factor, Kk = 1 ... К, to be 
included in the regression model, the following steps are performed. 


(a) The loading weight vector м; is calculated by maximizing the covariance 
between the linear combination of X, _ and y, | given that w”. w, = 1. 

(b) The factor scores, t, are estimated by projecting X,.. , on wy. 

(c) The loading vector p, is determined by regressing X, , on f, and 
similarly 4, by regressing y... ON ty. 

(d) From (X,..; — t.p.) and (ук—1 — te. qu") new matrices X, and y, are 
formed. 


The optimum number of factors to include in the model is found by observa- 
tion and usual validation statistics. 

For our tryptophan UV absorbance data these steps provide the results 
shown in Table 19. 

The loading weight vector, w,, is calculated from 


WwW = с. Хо .Уо (67) 
where the scaling factor c is given by 


€ = (уот. Xo. Хот. yo)? 
= 0.036 (68) 


The factor scores and loadings are estimated by 
і = Xo. 


Р = (Хог. й)/ (67 .£) 


qi = v A)T. 5) (69) 
The matrix and vector of residuals are finally computed, 
Х =X t.p" (70) 
and 
у=." (71) 


The regression coefficients сап be calculated by 
b= WPT. W) q (72) 
and 


a=y- Х.Ь (73) 
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Figure 23 True vs. predicted tryptophan concentration using a one-factor partial least 
squares regression model 


where W is the matrix of loading weights, each column is a weight vector, and P 
the matrix of loadings. 
With the single factor in the model the regression equation is, 


Tr(one factor) = — 8.73 + 3.014 + 17.134}; + 17.85415 (74) 
+ 8.6441; Бе 1.834, + 0.05424 + 1.03475; 
The predicted vs. actual concentration as a scatter plot is illustrated in Figure 
23. 
The procedure is repeated with a second factor included and, 


Tr(two factors) = — 0.53 — 13.904, + 25.6441: + 23.89 4,5 
+ 0.30445 — 13.2742, — 0.62424 + 0.5945; (75) 


and with three factors, 


Tr(three factors) = — 0.31 — 12.554» + 31.03412 + 19.414,5 (76) 
— 0.74А\з — 12.78A;, — 2.8245, — 4.924»; 

The scatter plots are shown in Figure 24. The sums of squares of the residuals 
for the one, two, and three-factor models are 201, 11.53, and 10.64 respectively 
and the estimated tryptophan concentrations from the test solutions are 


Xı (1l mgkg) — X; (25 трке!) 
One factor 6.35 22.01 
Two factors 10.93 25.98 
Three factors 11.18 25.92 
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Figure 24 True vs. predicted tryptophan concentration using a two-factor partial least 
squares regression model 


As with PCR, a regression model built from two orthogonal new variables 
serves to provide good predictive ability. 

Regression analysis is probably the most popular technique in statistics and 
data analysis, and commercial software packages will usually provide for 
multiple linear regression with residuals analysis and variables subset selection. 
The efficacy of the least squares method is susceptible to outliers, and graphic 
display of the data is recommended to allow detection of such data. In an 
attempt to overcome many of the problems associated with ordinary least 
squares regression, several other calibration and prediction models have been 
developed and applied. As well as principal components regression and partial 
least squares regression, ridge regression should be noted. Although PCR has 
been extensively applied іп chemometrics it is seldom recommended by statis- 
ticians. Ridge regression, on the other hand, is well known and often advocated 
amongst statisticians but has received little attention in chemometrics. The 
method artificially reduces the correlation amongst variates by modifying the 
correlation matrix іп a well defined but empirical manner. Details of the 
method can be found in Afifi and Clark. To date there have been relatively few 
direct comparisons of the various multivariate regression techniques, although 
Frank and Friedman" and Wold'? have published a theoretical, statistics based 
comparison which is recommended to interested readers. 


14 ТЕ. Frank and J.H. Friedman, Technometrics, 1993, 35, 109. 
15 S. Wold, Technometrics, 1993, 35, 136. 
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А.1 


Chemometrics is predominantly concerned with multivariate analysis. With 
any sample we will make many, sometimes hundreds, of measurements in order 
to characterize the sample. In optical spectrochemical applications these 
measures are likely to comprise absorbance, transmission, or reflection metrics 
made at discrete wavelengths in a spectral region. In order to handle and 
manipulate such large sets of data, the use of matrix representation is not only 
inevitable but also desirable. 

А matrix is a two-way table of values usually arranged so that each row 
represents a distinct sample or object and each column contains metric values 
describing the samples. Table 1(a) shows a small data matrix of 10 samples, the 
percent transmission values of which are recorded at three wavelengths. Table 
1(b) is the matrix of correlations between the wavelength measures. This is а 
square matrix (the number of rows is the same as the number of columns) and it 
is symmetric about the main diagonal. The matrix in Table (с) of the mean 
transmission values has only one row and is referred to as a row vector. This 
vector can be thought of in geometric terms as representing a point in three- 
dimensional space defined by the three wavelength axes, as shown in Figure 1. 

Matrix operations enable us to manipulate arrays of data as single entities 
without detailing each operation on each individual value or element contained 
within the matrix. To distinguish a matrix from ordinary single numbers, or 
scalars, the name of the matrix is usually printed in bold face, with capital 
letters signifying a full matrix and lower-case letters representing vectors or 
one-dimensional matrices. 

Thus if we elect to denote the data matrix from Table (а) as А and each row 
as a vector r and each column as a vector c then, 


ry 
A= x -[n e ... e] (1) 


Fm 
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Table 1 The percent transmission at three wavelengths of 10 solutions, (a), the 
correlation matrix of transmission values (b), and the mean transmission 


(a) 


(b) 
1.00 
0.61 
- 0.10 

0.95 
— 0.99 
- 0.74 

0.37 
- 0.03 
— 0.78 
- 0.30 


(с) 


values as a row vector (с) 


Sample à А; 


№ 


% Transmission 


1 82 58 
2 76 76 
3 58 25 
4 64 54 
5 25 32 
6 32 36 
7 45 54 
8 56 17 
9 58 59 
10 47 65 


061 -010 095 -099 -0.74 
1.00 -085 033 -0.73 -0.99 
- 0.85 10 023 026 074 
0.33 023 100 -0.88 —0.48 
-073 0.26 —0.88 100 0.84 
—0.99 074 -048 0.84 1.00 
0.96 —0.96 0.06 – 0.52 -0.90 
— 0.81 100 029 019 0.70 
-097 069 -0.54 087 1.00 
0.58 —0.92 -0.58 014 -043 


54 


51 
87 
56 
35 
54 
22 
83 
62 
45 


0.37 -0.03 – 0.78 -0.30 
0.6 —0.81 -0.97 0.58 


- 0.96 1.00 0.69 – 0.92 


0.00 029 —0.54 -0.58 


-052 019 087 0.14 
—090 0.70 100 -043 


100 -094 -087 0.78 


—094 100 0464 —0.95 
—0.87 064 1.00 -0.36 


0.78 -095 —036 1.00 


Гасан = (54.3 47.6 45.9] 














M 


Figure 1 Vector of means as a point in three-dimensional space 
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Table 2 Some identity matrices 


100 
һ= | | L-|0 1 0 
001 


where m is the number of rows and n is the number of columns. Each individual 
element of the matrix is usually written as a; (i=1...m,j=1...n).Ifn=m 
then the matrix is square and if a, = ау it is symmetric. 

A matrix with all elements equal to zero except those on the main diagonal is 
called a diagonal matrix. An important diagonal matrix commonly encountered 
in matrix operations is the unit matrix, or identity matrix, denoted J, in which all 
the diagonal elements have the value 1, Table 2. 


A.2 Simple Matrix Operations 


If two matrices, A and B, are said to be equal, then they both must be of the 
same dimensions, i.e. have the same number of rows and columns, and their 
corresponding elements must be equal. Thus the statement 4 — B provides a 
shorthand notation for stating а; = b; for all i and all j. 

The addition of matrices can only be defined when they are the same size, the 
result being achieved simply by summing corresponding elements, i.e. 


C=A+B 
or, 
су= aj by, for alliandj (2) 


Subtraction of matrices is defined in a similar way. 

When a matrix is rotated such that the columns become rows, and the rows 
become columns, then the result is the transpose of the matrix. This is usually 
represented as A‘. If B = АТ then, 


Бу = аз foralliandj (3) 


In a similar fashion, the transpose of a row vector is a column vector, and vice 
versa. Note that a symmetric matrix is equal to its transpose. 

Matrix operations with scalar quantities is straightforward. To multiply the 
matrix A by the scalar number k implies multiplying each element of A by k. 


КА —k.aj, foralliandj (4) 


Similarly for division, addition, subtraction, and other operations involving 
scalar values. The transmission matrix of Table 1(a) can be converted to the 
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Table 3 Solution absorbance values at three wavelengths (from Table 1а) 


Sample М À; А 
Absorbance 

1 0.09 0.24 0.27 
2 0.12 0.12 0.29 
3 0.24 0.60 0.06 
4 0.19 0.27 0.25 
5 0.60 0.49 0.46 
6 0.49 0.44 0.27 
7 0.35 0.27 0.66 
8 0.25 0.77 0.08 
9 0.24 0.23 0.21 

10 0.33 0.19 0.35 


corresponding matrix of absorbance values by application of the well known 
Beer's Law relationship, 


cj = log — (5 


ау 


with the resultant matrix C given in Table 3. 


A.3 Matrix Multiplication 


The amino-acids tryptophan and tyrosine exhibit characteristic UV spectra in 
alkaline solution and each may be determined in the presence of the other by 
solving a simple pair of simultaneous equations. 


Am 300 = Ате,зоо + Ату,з00 = єтг,з0оСтг + Ету,300 Сту 
Am200 = Атс2оо + Ату,200 = €rr200 €Tr + Єту,200Сту (6) 


In dilute solution, the total absorbance at 300 nm of the mixture, 4,, зоо, is 
equal to the sum of the absorbance from tryptophan, Ат, зоо, and tyrosine, 
Ary,3oo. These quantities in turn are dependent on the absorption coefficients of 
the two species, er, and ету, and their respective concentrations, cr, and сту. 

Equation (6) can be expressed in matrix notation as 


А-еС (7) 


where A is the matrix of absorbance values for the mixtures, є the matrix of 
absorption coefficients, and C the matrix of concentrations. The right-hand 
side of Equation (7) involves the multiplication of two matrices, and the 
equation can be written аѕ 
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Азоо | _ | €rr,300 €ry;300 | | Crr 8 
А E € ( ) 
m,200 Tr200 Єту,200 | | Сту 
The 2 X 1 matrix of concentrations, multiplied by the 2 х 2 matrix of absorp- 
tion coefficients, results in a 2 х 1 matrix of mixture absorbance values. 
The general rule for matrix multiplication is, if 4 is a matrix of m rows and n 


columns and B is of n rows and p columns then the product A. B is a matrix, C, 
of m rows and p columns: 


Cy = an. by + an.-by +... + Ain. by (9) 


This product is only defined if B has the same number of rows as A has 
columns. Although A.B may be defined, B. A may not be defined at all. Even 
when А.В and В.А are possible, they will in general be different, i.e. matrix 
multiplication is non-commutative. If A is a 3 х 2 matrix and B is a 2 х 3, then 
A.Bis 3 x 3 but B.A is 2 x2. 

The effects of pre-multiplying and post-multiplying by a diagonal matrix are 
of special interest. Suppose А and W are both m x m matrices and W is 
diagonal. Then the product 4. W is also a m х m matrix formed by multiplying 
each column of A by the corresponding diagonal element of W. W. А is also 
m X m but now its rows are multiples of the rows of А. In Table 4(a) the 
elements of the matrix W are the reciprocals of the maximum absorbances from 
each of the 10 samples from Table 3. The product W. А, shown in Table 4(b), 
represents the matrix of spectra now normalized such that each has maximum 
absorbance of unity. 


Table 4 The diagonal matrix of weights for normalizing the absorbance data (a) 
and the normalized absorbance data matrix (b) 


(a) 0.60 0 0 
W=| 0 077 0 
0 0 0.66 
(b) Sample à, А; às 
Absorbance 


0.14 0.31 0.41 
0.20 0.15 0.44 
0.39 0.78 0.09 
0.32 0.35 0.38 
1.00 0.64 0.69 
0.82 0.58 0.41 
0.58 0.35 1.00 
0.42 1.00 0.12 
0.39 0.30 0.32 
0.54 0.24 0.53 


C Mo 00-100 > ооз 


_ 
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A.4 Sums of Squares and Products 


The product obtained by pre-multiplying a column vector by its transpose is a 
single value, the sum of the squares of the elements. 


xl.x = Хх)? (10) 


Geometrically, if the elements of x represent the coordinates of a point, then 
(x1 . x) is the squared distance of the point from the origin, Figure 2.. 
If y is a second column vector, of the same size as x, then 


xT.y 2 yl. x = (ху) (11) 
and the result represents the sums of the products of the elements of x and y. 


XT. 
GU ras = cos6 | (12) 
where 0 is the angle between the lines connecting the two points defined by each 
vector and the origin, Figure 3. 

If x™.y = 0 then, from Equation (12), the two vectors are at right angles to 
each other and are said to be orthogonal. 

Sums of squares and products are basic operations in statistics and chemo- 
metrics. For a data matrix represented by X, the matrix of sums of squares and 
products is simply XT X. This can be extended to produce a weighted sums of 
squares and products matrix, C: 


x22, X23 


x'«[2,3] 











Figure 2 Sum of squares as a point in space 
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4 x'-[2,3] 
| y -[4.3] 











Figure 3 The angle between two vectors (see text) 


C-X.W.X (13) 


where W is a diagonal matrix, the diagonal elements of which are the weights 
for each sample. 

These operations have been employed extensively throughout the text; see, 
for example, the calculation of covariance and correlation about the mean and 
the origin developed in Chapter 3. 


A.5 Inverse of a Matrix 


The division of one scalar value by another can be represented by the product 
of the first number and the inverse, or reciprocal, of the second. Matrix division 
is accomplisbed in a similar fashion, with the inverse of matrix А represented by 
A^ !. Just as the product of a scalar quantity and its inverse is unity, so the 
product of a square matrix and its inverse is the unit matrix of equivalent size, 
ie. 


А.А = A^. A- I (14) 


The multivariate inverse proves useful in many chemometric algorithms, 
including the solution of simultaneous equations. In Equation (6) a pair of 
simultaneous equations were presented in matrix notation, illustrating the 
multivariate form of Beer's Law. Assuming the mixture absorbances were 
recorded, and the values for the absorption coefficients obtained from tables or 
measured from dilute solutions of the pure components, then rearranging 
Equation (7) leads to 


€ !.4A-C (15) 
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from which the concentration vector can be calculated. 

In general, a square matrix can only be inverted if each of its columns is 
linearly independent. If this is not the case, and a column is some multiple of 
another, then the matrix cannot be inverted and it is said to be ill-conditioned or 
singular. 

The manual calculation of the inverse of a matrix can be illustrated with a 
2 х 2 matrix. For larger matrices the procedure is tedious, the amount of work 
increasing as the cube of the size of the matrices. 


а= | | апа a=? d | (16) 


r 5 


then from Equation (14) we require 


| 


i.e. 
pa*qc pbt*tqd| |l 0 (17) 
ra+sc rb sd 0 1 
Therefore, 

ра + 4с = 1 
pb+qd=0 
ra+sc=0 
rh+sd=1 (18) 


Multiplying the first equation by d and the second by с, 
рай + дса = а, and pbc+qdc=0 
and subtracting, 
plad — bc) = d 
ог, 
p = d/(ad — bc) = djk (19) 


where k = (ad — bc). 
From the second equation we have 


q= —pb/d= — db/kd = — bjk (20) 


Similarly from the third and fourth equations, 
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к= —с/К and s=a/k (21) 


Thus the inverse matrix is given by 


An's Е pra кей (22) 


The quantity k is referred to as the determinant of the matrix A, written | А |, 
and for the inverse to exist | 4| must not be zero. The matrix : : has no 
inverse since (1 x 6) — (2 x 3) = 6; the columns are linearly dependent and the 
determinant is zero. 


A.6 Simultaneous Equations 


We are now in a position where we can solve our two-component, two- 
wavelength spectrochemical analysis for tryptophan and tyrosine. 

A 1 mgl! solution of tryptophan provides an absorbance of 0.4 at 200 nm 
and 0.1 at 300 nm, measured in а 1 cm path cell. The corresponding absorbance 
values, under identical conditions, for tyrosine are 0.1 and 0.3, and for a 
mixture, 0.63 and 0.57. What is the concentration of tryptophan and tyrosine in 
the mixture? 

From the experimental data, 


_ [0.63 04 01 

Аһ- БЕ «- 01 0. 3l Q3) 
Using Equation (22), 
e| = (0.12 0.01) = 0.11 (24) 
and 

a [o3/011 — -01/011 

ass | —0.1/0.11. 0.4/0.11 Q5) 
Cn 7€ .A, 
= (1.72 — 0.52), (— 0.57 + 2.07) 

= (1.2, 1.5) (26) 


In the mixture there are 1.2 mg 1^! of tryptophan and 1.5 mg 17! of tyrosine. 


А.7 Quadratic Forms 


To this point our discussions have largely focused on the application of 
matrices to linear problems associated with simultaneous equations, applica- 
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tions that commonly arise in least-square, multiple regression techniques. One 
further important function that occurs in multivariate analysis and the analysis 
of variance is the quadratic form. 
The product x™.A.x is a scalar quantity and is referred to as a quadratic 
form of x. In statistics and chemometrics А is generally square and symmetric. 
If A is a 2 х 2 matrix, 


b afa aa] [s] 

= (axi + a21 ах + 0222) s (27) 

= ах? + арухухә + Ха) + ax? 
and if azı = a). (А is symmetric), 

= xay + 2X1 X242, + xa (28) 
and if А = I, 

=x +x? (29) 

Thus, the quadratic form generally expands to the quadratic equation des- 


cribing an ellipse in two dimensions or an ellipsoid, or hyper-ellipsoid, in higher 
dimensions, as described in Chapter 1. 
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